Data Lake Implementation and Management
For many companies, data lakes are a core part of their DataOps ecosystem. After years of experiencing the unfulfilled promises of data warehouses, data lakes promised to deliver a central repository where data consumers could access enterprise data, both structured and unstructured, for use in analytics and decision-making.
Unfortunately, many of those pristine data lakes quickly turned into data swamps. The data became siloed and dirty, which made it difficult to use and even more difficult to trust. The data lake also lacked governance and a way to master the data in order to ensure that it was clean, curated, and continuously-updated.
Said differently, organizations began to see data lakes as overhyped. Or in the words of Tamr co-founder and CEO Andy Palmer, they are “dead and dying.”
Are data lakes still valuable today?
In their purest form, data lakes provide benefits to organizations who use them. They enable organizations to bring together structured and unstructured data from a myriad of systems and sources into a single repository that the organization can use to power analytics, dashboards, and decision-making.
However, when data lakes deteriorate into data swamps filled with dirty, siloed data, they are no longer valuable to their organizations. Disconnected, unclean, ungoverned data is difficult to trust. And when users don’t trust the data, they will not use it to drive decisions.
How to build a valuable data lake
Having clean, curated, continuously-updated data that crosses organizational silos is valuable. That’s a fact. That’s why organizations must prioritize the following when it comes to their data lakes:
1) Appoint someone responsible for integrating and mastering data
When integrating and mastering data within a data lake is everyone’s job, it becomes nobody’s job. That’s why it’s imperative that you give someone the responsibility of integrating and mastering data in the lake. This could be a Chief Integration Officer, or it could simply be someone within your DataOps organization.
2) Eliminate data silos
Developing a strategy to eliminate data silos is another priority. Most organizations today face challenges with data silos. But until you determine how to get rid of them, the problem will persist. By creating a strategy to integrate and master data (see point #1), you’ll set your organization up to eliminate data silos once and for all.
3) Make data cleaning everyone’s job
This step is simple. Anyone who enters data into a system within your organization is responsible for cleaning it. Period. End of story. It doesn’t matter if the person is a data entry analyst or a part of the C-suite, keeping data clean is everyone’s responsibility. Hold your organization accountable for keeping data clean and you’ll prevent the inevitable data swamp from occurring.
Managing a data lake
Managing a data lake requires dedication to keep the data in it clean, integrated, and continuously-updated. There are many ways to do this, but the most valuable way to do so is by treating data as a product.
Treating data like a product means implementing a data product strategy that brings structure to the ownership, processes, and technology needed to ensure their organization has clean, curated, continuously-updated data for downstream consumption. Companies implement data product strategies through the design and use of a data product, which makes data tangible for the organization. It’s a consumption-ready set of high-quality, trustworthy, and accessible data that people across an organization can use to solve business challenges.
As a result, companies can increase competitiveness by improving the customer experience or creating product differentiation, and deliver value by driving growth, saving money, and reducing risks.