Data Lake Benefits: Gimme Some Ratings

I rely on scores and quick summaries more than I should.  If it’s the Gartner Magic Quadrant, I immediately look to the upper-right-most dots and likely will not read anything from the report’s methodology and only skim the narrative analysis.  If I am buying a washer, I just want to buy the full Consumer Report Harvey Ball model.  If I am selecting a movie to stream, all I need to know is the Tomatometer rating.  My list of short-attention-span-sins is long.  It would probably tax your reading patience and certainly is beyond my writing fortitude.

We are going to be writing a lot about Data Lakes.  For now though, let’s just quickly riff through some of the purported benefits of a Data Lake architecture.  If these benefits appeal to you, it may be worth learning more about the subject.  I am also including a score, from 1-5 in increasing attention urgency.  If you would like to review the methodology used in developing this score, or the associated narrative discussion please feel to contact me.

Benefit One – You can collect data from multiple sources and store easily store it.

Amazon S3

You are about to see a very low attention urgency score.  This is NOT because this is an unimportant benefit.  It is.  We just believe that you should be open to a simple solution for this one.  Make your Data Lake Amazon Simple Storage Service (Amazon S3).  For the past few years you have been urged to create an HDFS Data Lake.   We are telling you that there may well be a role for Hadoop or Spark clusters, but first, just get your data in to Amazon S3.  Say it: “My Data Lake is Amazon S3.” Attention Urgency Score: 1.

Narrative Discussion Avoidance:  How you ingest your data in to an Amazon S3 Data Lake is a very important subject.  We are arbitrarily and summarily not addressing the topic at this time.  Our hunch is that you should set up a scalable, secure and reliable method that supports batch ingestion of flat files and higher velocity streaming.  Your ingestion approach probably does not need to be state-of-the-art for petabyte dumps or massive velocity real-time cyber security feeds.  Emerging from this edge case dust, there should however, be a straightforward and reliable ingestion approach that suits 90% of your expected experience.