I’m having trouble getting to the bottom of what exactly a “data lake” is. Like you, I’ve been hearing the term for a few years now, but I’m still not sure that it describes something for which pre-existing industry terminology wasn’t sufficient.

At the risk of sounding pedantic, I’d like to pick apart a few “data lake” definitions that I’ve found in industry sources. The purpose of this exercise is to illustrate that the term doesn’t communicate any shades of meaning that couldn’t have been expressed just as well using existing terminology.

For starters, it’s obviously a metaphor, just as the terms “data warehouse” and “data mart” started as metaphors. But these latter terms have acquired reasonably clear literal definitions that serve as clear blueprints for data architects. From 2012, here is a good dissection of Bill Inmon’s definition of “data warehouse.” Here’s another article laying out the differences between Inmon and Ralph Kimball on the issue of what constitutes a data warehouse vs. a data mart. In fact, the industry consensus on two broad schools of architectural thought—Inmon vs. Kimball—on the “warehouse” vs. “mart” question is longstanding and widely recognized.

From an architectural standpoint, it’s not clear whether the notion of a “data lake” is crystallizing around an equivalent consensus among practitioners. In trying to get my head around the definitional differences, I did what you too have probably done many times: I Googled it. Here are the first few meaty hits that popped up in my browser regarding what a “data lake” might be:

I always keep a red pen handy for these kinds of occasions. What if we redlined the terms in these definitions that don't in any way distinguish “data lake” architecturally from any other big-data deployment model? That would prepare us to whip out Occam’s Razor to whittle them down to a definition that standard industry terms don’t adequately express:

A few years ago, my colleague Tom Deutsch had an excellent discussion of schema-on-read vs. schema-on-write, crisply characterizing the former as "write your data first and then figure how you want to organize it later." Deutsch clearly spells out pros and cons of each approach in different big-data use cases. But there's nothing about schema on read, as a requirement, that necessitates any data platform other than a suitably scalable storage/file system. But he never once specifies that it has to happen in a new type of big-data platform (and he never once uses the term "data lake").

Truth be told, "data lake" is often employed in the same breath as "Hadoop." Here's just one example of an article that makes this conceptual linkage explicit. This should alert you to the fact that this new term is primarily an attempt to favor one type of big-data platform when the discussion turns to conglomeration of data storage, refinement, and sandboxing functions. In fact, the cited article munges all of these functions into an entirely Hadoop-centric "data lake" discussion that, of course, includes schema-on-read as an intrinsic feature.

When you think about it, what everybody's referring to here is essentially an archive, albeit one that serves more of an exploratory data-science modeling role than e-discovery and other traditional archival functions.

Then why not simply call it an “exploratory data archive"? Schema-on-read makes perfect sense as a core feature in such a use case, when the archive exists to support subsequent mining, modeling, and statistical analysis of multistructured data.

One advantage of the term I've just proposed is that it’s a literal description of this use case. It’s not some soggy new metaphor.