I’m having trouble getting to the bottom of what exactly a “data lake” is. Like you, I’ve been hearing the term for a few years now, but I’m still not sure that it describes something for which pre-existing industry terminology wasn’t sufficient.
At the risk of sounding pedantic, I’d like to pick apart a few “data lake” definitions that I’ve found in industry sources. The purpose of this exercise is to illustrate that the term doesn’t communicate any shades of meaning that couldn’t have been expressed just as well using existing terminology.
For starters, it’s obviously a metaphor, just as the terms “data warehouse” and “data mart” started as metaphors. But these latter terms have acquired reasonably clear literal definitions that serve as clear blueprints for data architects. From 2012, here is a good dissection of Bill Inmon’s definition of “data warehouse.” Here’s another article laying out the differences between Inmon and Ralph Kimball on the issue of what constitutes a data warehouse vs. a data mart. In fact, the industry consensus on two broad schools of architectural thought—Inmon vs. Kimball—on the “warehouse” vs. “mart” question is longstanding and widely recognized.
From an architectural standpoint, it’s not clear whether the notion of a “data lake” is crystallizing around an equivalent consensus among practitioners. In trying to get my head around the definitional differences, I did what you too have probably done many times: I Googled it. Here are the first few meaty hits that popped up in my browser regarding what a “data lake” might be:
- Wikipedia: "A massive, easily accessible data repository built on (relatively) inexpensive computer hardware for storing 'big data.' Unlike data marts, which are optimized for data analysis by storing only some attributes and dropping data below the level aggregation, a data lake is designed to retain all attributes, especially so when you do not yet know what the scope of data or its use will be."
- Edd Dumbill: "[A] place with data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment. Applications are no longer islands, and exist within the data cloud, taking advantage of high bandwidth access to data and scalable computing resource. Data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise."
- Andrew C. Oliver: “Instead of planning and building complex integrations and carefully constructed data models for analytics, you simply copy everything by default to commodity storage running HDFS (or Gluster or the likes) -- and worry about schemas and so on when you decide what job you want to run.”
- Some Hadoop user group: “[A] place to store practically unlimited amounts of data of any format, schema and type that is relatively inexpensive and massively scalable. Data processing software is available to transform the data from its raw state to a finished product."
I always keep a red pen handy for these kinds of occasions. What if we redlined the terms in these definitions that don't in any way distinguish “data lake” architecturally from any other big-data deployment model? That would prepare us to whip out Occam’s Razor to whittle them down to a definition that standard industry terms don’t adequately express:
- Scalability of the data platform: Stating that a data lake is “massively scalable" is no differentiator. After all, extreme scalability is the essence of all big-data deployment approaches.
- Runtime application execution on the data platform: One definition trumpets the notion that, in data lakes, “applications are no longer islands, and exist within the data cloud, taking advantage of high bandwidth access to data and scalable computing resource." One might interpret this as an oblique reference to the concept of in-database analytics. And that, in turn, is an integral feature of Spark, MapReduce, and other big-data platforms and runtimes, none of which need to be deployed in a “data lake,” however you might define the latter.
- Consolidation, efficiency, and accessibility of the data platform: This ideal is expressed in the phrases "easily accessible," "(relatively) inexpensive computer hardware," and "data-centered architecture, where silos are minimized, and processing happens with little friction in a scalable, distributed environment." None of these features or attributes is specific to the notion of a “data lake.” They’re also intrinsic to the concept of an "enterprise data warehouse."
- Schema-on-read for exploratory analytics on an archival data platform: This feature is alluded to in any or all of the following phrases: “designed to retain all attributes, especially so when you do not yet know what the scope of data or its use will be," "place to store practically unlimited amounts of data of any format, schema and type," “worry about schemas and so on when you decide what job you want to run ,” and "data itself is no longer restrained by initial schema decisions, and can be exploited more freely by the enterprise.” These are important architectural concepts for the new era of data scientists who work with big data. But none of them distinguish a data lake from the equivalent, established concept of “data refinery,” unless the latter’s role is constrained to data acquisition, transformation, and cleansing. But even if you allow the term “data lake” to muscle “data refinery” out of one of its longstanding meanings—as an exploratory sandbox for unstructured data—it's not clear that what’s generally known as “schema on read” is a defining feature of this renamed entity. After all, data professionals have long availed themselves of ETL tools, executed on a data refinery, to overcome "initial schema decisions." That has typically involved pulling raw data from source repositories into distributed file systems (wherein all attributes are often retained), and transforming the data from disparate source data formats, schemas, and types to myriad target formats, schemas, and types. This is standard practice everywhere, enabling data to be "exploited more freely by the enterprise" when delivered downstream to data warehouses, data marts, data sandboxes, etc.
A few years ago, my colleague Tom Deutsch had an excellent discussion of schema-on-read vs. schema-on-write, crisply characterizing the former as "write your data first and then figure how you want to organize it later." Deutsch clearly spells out pros and cons of each approach in different big-data use cases. But there's nothing about schema on read, as a requirement, that necessitates any data platform other than a suitably scalable storage/file system. But he never once specifies that it has to happen in a new type of big-data platform (and he never once uses the term "data lake").
Truth be told, "data lake" is often employed in the same breath as "Hadoop." Here's just one example of an article that makes this conceptual linkage explicit. This should alert you to the fact that this new term is primarily an attempt to favor one type of big-data platform when the discussion turns to conglomeration of data storage, refinement, and sandboxing functions. In fact, the cited article munges all of these functions into an entirely Hadoop-centric "data lake" discussion that, of course, includes schema-on-read as an intrinsic feature.
When you think about it, what everybody's referring to here is essentially an archive, albeit one that serves more of an exploratory data-science modeling role than e-discovery and other traditional archival functions.
Then why not simply call it an “exploratory data archive"? Schema-on-read makes perfect sense as a core feature in such a use case, when the archive exists to support subsequent mining, modeling, and statistical analysis of multistructured data.
One advantage of the term I've just proposed is that it’s a literal description of this use case. It’s not some soggy new metaphor.