Why Low Latency Matching is critical to Data Mastering at Scale

It’s common to think of data mastering as the way we turn our disparate and dirty data sources into something clean that we can use to power data applications. A few dozen sources of customer data go through the mastering cycle, and now we’ve got a single source of truth for each customer that we can feed into our CRM application, sales analytics visualizations, and churn models. Over the last couple of decades, we’ve gotten much better at this, and are continually innovating in the techniques and technologies we bring to solve the core problem of turning bad data into useful data.

But how much are we doing to actually stop new bad data from being produced going forward? Sure, we can determine that there are 13 duplicate records for customer X and make a single golden record, but are we doing anything to stop that 14th variant record from being produced? Does that CRM user entering customer data have the tools to know that the other 13 exist, and if they did, would they have time to update the golden record with the best data while they are on the phone with the customer? Doubtful. You can put in guardrails to prevent some of this, but in the end, more bad data is coming your way – from your own people!

The majority of mastering solutions may focus on producing clean data for analytical applications, but these other operational use cases expose big holes in data quality that must be addressed – in real time. Hundreds of users may be entering duplicate or contradictory data from dozens of systems all the time just so they can complete their time-sensitive tasks, and we’re relying on a handful of data engineers to master our way around the problem. Bad odds, regardless of how well equipped they may be with technology that tries to corral and repair deteriorating data. As we’ve discussed in other posts, this is a challenge of data mastering at scale, for which low-latency matching is an essential component.

What we need is a way to quickly signal that the data being entered already exists in some better form, something which won’t be caught by validations that only focus on the internal consistency of a field or a record. And we need to raise this signal within the tools that people use to work with the relevant data – within the CRM application where they’re entering customer information, within the procurement tool where they’re entering new invoices, etc. When a user starts entering a new supplier into your ERP system, that screen needs to quickly show existing matches as they’re typing the information in, provide a recommended supplier record or prompt for a truly new entry, and do so in the time it took you to read this sentence.

This requires the same power of an MDM system, but a sub-second response time through an API which can be accessed from any applicable downstream system. Without compromising accuracy, and with all of the relevant data presented. It’s a tough challenge to build and maintain a system that can accommodate traditional mastering workloads at enterprise scale, but that can also actively prevent data quality issues by deputizing everyone who creates data to curate it as well. These tools exist. We built one.




Pictured above is an example of low-latency matching embedded in the contact entry page within SalesForce.com. As a new contact record is being entered, prior to committing the new data, an API call is made to match the data against the mastering system. A match has been found, so the applicable master data record is returned to the end user as a suggested existing contact record. The user can accept this suggestion – thereby preventing the addition of duplicate data – or indicate that this is indeed a new contact and should be added to the source system. A more granular treatment can be implemented, wherein only missing or contradictory field values are included in this feedback workflow. The flexible APIs and configurable user interfaces of modern applications make all of this possible.

In this scenario, we are not merely preventing data pollution at the point of entry; the acceptance or rejection of the matching recommendations is being fed into the mastering system to make it more accurate. This feedback loop provides a virtuous cycle of increasing mastering quality, and actually gets better as more users and data are incorporated into the process – a key attribute of the modern approach to mastering at scale. No one is waiting for a nightly mastering run. No one is learning a new user mastering interface. And everyone is making cleaner data, every day.