Written by Matt Holzapfel
To those unfamiliar, data unification may seem insignificant. After all, it can’t be that hard to unify data, right? Unfortunately, this is a huge misconception. Data unification is an incredibly complex process and one of the biggest challenges many large organizations face today.
Before we dig into how it works, let’s take a look at the definition of data unification:
Data Unification (n): The process of ingesting data from various operational systems and combining them into a single source by performing transformations, schema integrations, deduplications, and general cleaning of all the records.
The Data Unification Challenge
To understand the major challenges with data unification, think about all the different programs used at your organization. Each one captures data differently. Now imagine trying to combine all of the data across your organization into one master source. This process is incredibly difficult to achieve at scale, when hundreds of thousands of datasets are involved.
To give you a better idea of what this process entails, here’s a high-level breakdown of the data unification process from Michael Stonebraker’s white paper on The Seven Tenets Of Scalable Data Unification:
- Ingesting data, typically from operational data systems in the enterprise.
- Performing data cleaning, e.g., -99 is often a code for “null,” and/or some data sources might have obsolete addresses for customers.
- Performing transformations, e.g., euros to dollars or airport code to city_name.
- Performing schema integration, e.g., “salary” in one systems is “wages” in another.
- Performing deduplication (entity consolidation) e.g., I am “Mike Stonebraker” in one data source and “M.R. Stonebraker” in another.
- Performing classification or other complex analytics, e.g., classifying spend transactions to discover where an enterprise is spending money. This requires data unification for spend data, followed by a complex analysis on the result.
- Exporting unified data to one or more downstream systems
As you can see, data unification is very complex, which is why the vast majority of today’s organizations face a data mastering crisis.
The Data Preparation Ecosystem
Because of this data mastering crisis, there is an immediate need for organizations that “provide agile, curated internal and external datasets”, as Gartner puts it. These organizations are part of the rapidly expanding data preparation industry that is expected to grow over 18% YoY through 2021.
That explosive growth is largely due to the fact that organizations spend 60% of their time on data prep alone. New tools aim to greatly reduce this time and are quickly becoming the industry standard. According to Gartner’s research, 50% of all new projects will use data preparation tools by 2020.
Data unification is an integral part of this new data preparation ecosystem and is an essential input to tools used by analysts and consumers, such as self serve data prep tools and data catalogs. These users can’t be expected to be productive and generate meaningful business insights without a foundation of trustworthy data, which data unification provides.
The emergence of DataOps and the strategic need to increase analytic velocity in the enterprise has accelerated the move towards this modern architecture.
Out With The Old: The Traditional Data Unification Process Isn’t Effective
Legacy approaches to data unification typically revolve around ETL and MDM.
ETL or Extract, Transform, and Load involves writing an upfront global schema and then relying on a programmer to understand the schema and write conversion, cleaning, and transformation routines as well as all necessary record updates.
MDM or Master Data Management involves creating a master record where all entities across the organization are defined and then merging all records to match the master.
Both ETL and MDM are incredibly labor-intensive, requiring complex rules systems to be developed to unify data. These systems have a high upfront cost to develop and are costly to maintain. As a result, data unification efforts are often limited to a select few high-value data sources.
In With The New: A New Approach To Data Unification Is Working Wonders
A new, more effective data unification process has emerged. Using concepts from agile software development, giant organizations such as GE have fully mastered their data and gained access to powerful insights that have saved them 80 million and counting.
The agile approach uses a powerful data unification platform and a combination of machine-learning and human expertise to conquer the data. The result is data that is unified, mastered, and up-to-date, something that was near impossible with the old methods.
According to Forbes, humans create 2.5 quintillion bytes of data each day and growing. Businesses need data unification to make sense of this endless data stream to make smart, data-driven decisions and compete in a global economy. You’ve heard the expression knowledge is power. For modern-day businesses, that knowledge comes from having complete access to reliable, up-to-speed data.
To learn more about data unification and how Tamr can help you address these challenges, please reach out or schedule a demo. And you can download a copy of Michael Stonebraker’s ‘Seven Tenets of Scalable Data Curation’ below: