Why Your Data Mesh Strategy Needs Data Mastering

Data mesh has become one of the hottest topics in the data industry. It was certainly top-of-mind for many attendees at the MIT CDOIQ Symposium last week. 

As I explained during my presentation at the event, despite five decades of investment in systems for enterprise data quality, most large companies are still struggling to break down data silos and publish clean, curated, comprehensive data that is continuously updated. One of the primary reasons why companies are still struggling is because they’re missing a critical component of a modern data ecosystem: enterprise data mastering.

The evolution of the data ecosystem

In the 1990s and 2000’s, most data professionals were focused on building enterprise data warehouses and the vendors that provided warehouse solutions such as Informatica, Talend, and Cloudera. The enterprise data warehouse approach was somewhat successful because a company’s data ecosystem existed as a large monolithic artifact within the organization where they could govern and contain it.  

Fast forward ten years and we saw the introduction of next-generation analytics tools such as Qlik, Tableau, and Domo that tried to democratize the data ecosystem. Their goal was to have analysts, rather than database administrators, dictate how data should be processed and consumed in a distributed manner, with the assumption that data aggregation doesn’t work. 

The paradigm shifted again in the mid-2010s when cloud infrastructure provided the ability to quickly scale storage and compute very efficiently. Everyone again wanted to aggregate their data in a data lake – or at least move their data to the cloud first and figure out how to use it later.

A new approach to the data ecosystem

Which brings us to today. The data ecosystem is changing constantly with data volume and variety exploding. Data is also becoming more and more external. And the reliance on external data is only going to grow. Why? Because in many cases, the best version of the data exists outside of the firewall, not in your organization’s ERP or CRM solution.

To deal with data silos in analytics use cases, there are basically four different strategies you can employ:

  • Rationalization: consolidate data from different systems into one and eliminate the rest
  • Standardization: create consistent vocabularies and schemas and push them from one system to the rest
  • Aggregation: aggregate all them data into a central depositary such as a data warehouse
  • Federation: store and govern data in a distributed manner and interconnect different data sources by domain

The key point with these use cases is that “one size does not fit all,” as Tamr co-founder Mike Stonebraker has been arguing convincingly for years. In any successful data project, all four strategies are necessary, but are not sufficient on their own. They all need, in some shape or form, a centralized entity table and persistent universal ID to link data together. That’s where data mastering comes in.

With data mesh, the pendulum swings the other way again. According to Zhamak Dehghani, director of emerging technologies at Thoughtworks North America, data mesh is a new enterprise data architecture that embraces “the reality of ever present, ubiquitous, and distributed nature of data.” And it has four aspirational principles:

  1. Data ownership by domain
  2. Data as a product
  3. Data available everywhere (self-serve)
  4. Data governed where it is

In this new paradigm, the data is distributed and external. That’s why traditional, top-down master data management (MDM) simply will not work. Instead, organizations need to start with human-guided, machine learning-driven data mastering. It is a cornerstone for a successful data mesh strategy because it provides a centralized entity table and persistent universal IDs for users to do distributed queries.

How should you think about data mastering in the context of your data mesh strategy?

What’s inherent in data mesh is the belief that data is more distributed. And because it’s more distributed, you’ll need a way to resolve the differences in the data. This is really hard.

Think about it. Today, we need to think about data in terms of logical entities: customers, products, suppliers … the list goes on. But a dirty secret most companies hide is that they have thousands of sources that provide data about these entities, making it difficult, if not impossible, to do a data mesh. 

Companies embarking on a data mesh strategy will quickly realize that they need a consistent version of the best data across the organization. And that the only way to achieve this – at scale – is through machine learning-driven data mastering. 

Think about data mastering as a complement to data mesh. On their own, each produces a good result. But when you put them together, the results are even better. Just like two of my favorite complementary flavors, peanut butter and chocolate. 

When you apply human-guided, machine learning-driven data mastering, you clean up your internal and external data sources. You engage in a bi-directional cycle that enables you to cleanse and curate your data so that you can efficiently and effectively realize the promise of distributed data mesh. It’s a continuous loop – and that’s critical so you can incorporate changes to your data or the sources over time. 

As you build your data mesh strategy, remember that the key to success is starting with modern data mastering. 

To learn more schedule a demo

Schedule a Meeting