Written by Jerry Held
Welcome to Data Lakes, one of the more unlikely sources of controversy we’ve run across this year.
Gartner kicked off the hullabaloo in July with its report, “The Data Lake Fallacy: All Water and Little Substance,” which held that data lakes “carry substantial risks” – the most important being the enterprise’s “inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake.” Gartner cited security and access control as well as performance as additional reasons to view data lakes with a skeptical eye.
InfoWorld wasted no time jumping into the fray, referring to Gartner’s report as “nonsense” and calling data lakes “part of a greater movement toward data liberalization.” InfoWorld acknowledged the inherent risk of new technology, but suggested that risk can be mitigated with good security procedures, documentation and governance.
With all due respect to our friends at Gartner and InfoWorld, we think the issue is far more nuanced than their energetic appeals suggest. There is an important middle ground between Gartner’s pre-planned, highly controlled warehouse approach and InfoWorld’s “dump all the data in the lake” free-for–all.
On the surface, the logic behind data lakes is compelling: Put the full volume and variety of your enterprise’s data in one place and you’ll be able to manage it better, protect it better and, ultimately, derive valuable analytic insights. But while data lakes may indeed be able to help enterprises manage the Data Variety problem, without proper attention to the curation of this data, many data lakes are bound to turn into expensive, unproductive Data Swamps. Also, in many (probably most) cases, a single data lake will not be a practical answer. So data curation across multiple lakes, ponds or other storage locations may well be the most important investment. We call this Data Unification.
The fact is that Big Data is still in its infancy. Organizations are increasingly aware that leveraging ALL of the internal and external data available to them has massive potential benefit — be it a commercial competitive advantage or, as in the recent case of the National Institutes of Health mandating that genomic researchers share their data, a public good.
Many of these organizations, however, are still struggling with how best to collect, store, manage and access all that data for all their analytic purposes.
Some continue to attempt to pre-define a global corporate schema and ingest as much data into a well-organized repository. This tends to limit the variety of data that is available and is less able to support unanticipated analytic uses.
Others are taking the opposite route: Just save the raw data in one place for any potential analytic purpose, known or unknown. Compelling, right? You keep everything in one place, without destroying or limiting anything you might need down the road. The obvious problem: this approach has you accumulating tons of structured and unstructured data and promising yourself you’ll get back to it — someday. Pretty soon you’re overwhelmed with dark, murky data that you’ll need to invest a whole lot of money in just to see.
Today we call this place a “Data Lake.” Twenty years ago we called it a “Data Warehouse.”
Unfortunately, many who are setting out for a cruise on the Lake don’t remember the hard lessons from the early days in the Warehouse. Huge sums were spent on early warehouse projects that ran very late and far over budget with little or no return. It took years to learn the lesson that data need to be transformed, cleansed and organized properly to realize the hoped-for results. As data warehousing progressed, data transformation, data quality and data integration tools emerged to help streamline the process of getting the data ready for analysis.
Data (Lake) Curation
Today’s data lakes are attempting to store a far larger volume and, very importantly, a much wider variety of data (a recipe for swamp water).
Fortunately, as we saw with warehousing, specific tools – i.e., Data Curation – are emerging to provide enterprises much-needed transparency and enrichment for their data.
Data Curation is the process of preparing — discovering, analyzing, cleaning, transforming, combining, and de-duplicating — diverse data to work together for analytics and downstream applications. In the past, data curation hasn’t been pragmatic because of cost and scaling limitations of manual methods. New, scalable approaches like Tamr’s – which combines machine learning + guided human insight – allow enterprises to curate data continuously and pragmatically … taming the data variety “hydra” in the process. Not curation as a one-off project, but as part of the normal course of business. Further, different curations can be targeted to different analytic purposes based on your defined requirements.
With this pragmatic, continuous data curation approach, your data stays in the source format or system, a goal of most data lakes. But now, you have a directory and metadata that make the data lake much less murky – and more accessible to more people. With this approach, you have the ease of collecting all of your data without the need for a huge, predetermined schema (InfoWorld’s goal). And as curation progresses, you gain the clarity of how data relates (Gartner’s goal).
So, we see a world with data flowing freely into the data lake, followed by ongoing curation which improves its quality — and turning it into a clear Data Reservoir capable of high-quality analytics.
One or Many Lakes?
Finally, and again with hindsight from the Data Warehouse experience, the question is whether there will be one or many lakes? As Data Warehouses emerged, it didn’t take long for Data Marts to come on the scene. How long after the first attempts at Data Lakes will it be before the first Data Ponds emerge? And how many organizations will have only one Data Lake? These questions lead us to the conclusion that there will be huge payoff in investments in Data Curation, which will give a better understanding of what data exists in the Lakes, Ponds, Warehouse, Marts and other storage systems around the enterprise.
A well curated, unified view of data across the organization will provide the best foundation for high quality analytics.
For more on Tamr’s machine learning, human guided approach, watch our video here.