Allie Gilland
Allie Gilland
Analytics Engineer
October 25, 2021

Understanding The Advantages of Supervised Machine Learning for Master Data Management

Understanding The Advantages of Supervised Machine Learning for Master Data Management

Machine learning creates efficiencies, often saving us time and money, and reduces the burden of mundane activities, freeing our attention to focus on bigger topics at hand. Machine learning has created efficiencies in many aspects of day-to-day life, from virtual assistants like Siri and Alexa responding to our needs and GPS predicting traffic, to social media apps pre-tagging photographs and suggesting interests.

The benefits of machine learning for master data management are just as profound. Machine learning is a known efficiency tool to reduce the manual effort of technical processes, including master data management. Traditional approaches to master data management require high manual effort, from the development of custom, deterministic rules, to the ongoing maintenance of existing rules, and the integration of new data sources. Because of the high level of attention and manual effort to maintain traditional MDM solutions, it’s no surprise that 75% of MDM solutions fail, according to a Gartner report, as manual effort simply can’t keep up with the ever-changing data landscape.

While next-generation MDM providers like Tamr use machine learning to improve performance and handle the modern demands of scale, a common criticism of machine learning is that it’s a black-box approach. We will describe supervised machine learning and how it drives efficiencies in data management while enabling you to stay in control of your data.

What is Supervised Machine Learning?

Machine learning is a field of artificial intelligence that focuses on the use of data and algorithms to imitate the way humans learn. By identifying patterns, the models gradually improve accuracy over time. With supervised machine learning, the algorithms are trained by labeled datasets, essentially pair examples. The model is trained until it can recognize the underlying patterns between the input data and output label examples, enabling the model to accurately label new data that it hasn’t seen before.

Why Use A Supervised Machine Learning Approach For Data Mastering

In the practice of master data management, supervised machine learning is an efficient way to cluster records by identifying whether two or more records are a match – an essential part of creating a golden record view. When a person provides verified examples of mastered records, the machine learning component will learn to make future mastering decisions based on those.

Let’s look at examples of two records that represent the same entity, and how a rules-based approach and a supervised machine-learning approach would differ:

Rules-Based Approach For Matching Customer Records

A rules-based approach tries to code consistent logic for how records should be clustered and what fields or attributes constitute a record match. In plain language, the logic for coding rules for the example below might start with something like:

If (names are the same or approximately the same)
­ ­ ­ ­ ­ ­ AND (the postcode is the same)
­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ AND (URLs aren’t different)
­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ ­ AND (states are the same, full or abbreviated)
Then the records match.

We can already see that we’re in hot water. How do we define names as being approximately the same? One letter difference? Two letters? What about capital or lowercase? How would it deal with shortened names?

These issues become more challenging as more sources and records are added, as new cases emerge, and the existing rules won’t catch every new case. For example, it is complex to define rules that can recognize that ‘Uber’ and ‘Uber Technologies Inc’ are the same company, but ‘Zoom’ and ‘ZoomInfo’ are different. The rate of false-positive matches and false negatives quickly grows as the data expands, causing data engineers to invest more time reviewing the data and adding more rules for exceptions. Scale this to hundreds of millions of records, and you can see that it quickly becomes unmanageable. Not only that, but as the rules escalate, it becomes impossible to untangle the logic, making governance a nightmare.

Supervised Machine Learning Approach For Matching Customer Records

With Tamr’s data mastering built on supervised machine learning, a person who is familiar with the data reviews a set of pairs of records from the data. After labeling whether or not the pairs are a match, and represent the same entity, the machine learning model will learn over time to predict these responses. Tamr has a unique feature that allows for fast model training with fewer training questions. Tamr is able to identify “high-impact” pairs for a person to label as a match or not, providing maximum impact on the model with minimal effort. A machine learning approach can recognize that the ‘Tamr’ vs ‘tamer’ example above represent the same entity.  Additionally, with a pair comparison like the one below, Tamr’s machine learning model learned from the provided training examples to correctly suggest that this pair of records are a match, even though there is incorrect and incomplete information.

Key Benefits of Supervised Machine Learning for Master Data Management

A data mastering solution built on machine learning will specifically improve the speed, scalability, and accuracy of the classic approach to mastering. We’ll cover these features below:

Speed- Training, deploying, and maintaining your data mastering solution is faster with a machine-learning-driven approach. By reducing manual efforts, machine learning is eliminating the most time-consuming part of data mastering. While the scale of model training needed varies with the nature of the data, it typically takes only one day of effort to train a mastering model for millions of records. Months and years to deployment with traditional methods turn into days and weeks with machine learning. This allows for faster time to value for your MDM solution. Research by Forrester showed that most customers experience an efficiency saving overall of approximately 80%, as stated in this Forrester report.

Scalability- A traditional rules-based approach can work well for a handful of datasets, but it breaks down with too much volume or variety of data. Machine learning, on the other hand, can scale to handle hundreds of datasets with tens of millions of records. It gets easier to onboard additional datasets to your machine learning model as the existing model training acts as a foundation to build upon. The repeatability nature of machine-learning-driven mastering allows for continuous and scaled deployment.

Accuracy- A rules-based data mastering solution will only match records that fit the conditions of the rules. However, as there are all kinds of misspelled and missing entries in your data, there are expectations present, and these exceptions will not be covered under the scope of traditional rules. Machine learning, on the other hand, can handle these types of cases. In order to try to improve the accuracy of a rules-based MDM solution to catch these exceptions, you would need to spend hours combing through the data and the existing rules, adding logic to try to account for these cases. And even then, you will still miss some. While rules-based data mastering solutions typically have 60-80% accuracy, master data management with machine learning has 90%+ accuracy. This improved accuracy with machine learning puts trust back into your data, which your business can use to make informed business decisions.

Machine Learning Is A Must For Next Generation Master Data Management

To ensure time to value, a data mastering solution needs to be fast, scalable, and repeatable. A machine learning approach to master data management does just that. With simple examples and only minimal manual effort, machine learning can provide a cost-effective, broadly applicable, and trustworthy data mastering solution.