At the most basic level, data deduplication refers to the deletion and removal of redundant or duplicate data. It is an ongoing process to ensure no excess data is in your database, and that you’re using only a single copy of truth, or the golden record, for analytics or operations.
Redundant or duplicate data can harm your business and your strategy in many ways, both in operational use cases and analytical use cases. From an operational perspective, you can’t answer questions like which account is the right one to contact?
From an analytics perspective, it’s hard to answer questions like who are my top paying customers by revenue?
Data deduplication has a lot of overlap with data unification, where the task is to ingest data from multiple systems and clean it. It also overlaps with entity resolution, where the task is to identify the same entity across different data sources and data formats.
What are the benefits of deduplication?
Data deduplication can benefit your business in a myriad of ways. For example, improved data quality can lead to more cost savings, more effective marketing campaigns, improved return on investment, improved customer experience, and more.
Improve cost savings. This is the most obvious and direct benefit. First of all, it reduces data storage costs. Then, it helps save on data preparation and correction costs. Data analysts no longer need to spend 80% of their time on tasks such as data wrangling and transformation and can instead focus on more valuable data analysis. It also helps with employee churn.
More accurate analytics. In the example mentioned above, we were unsure which customer is our highest paying customer. In general, duplicate data really distorts a company’s visibility into its customer base and can derail analytics efforts. Data deduplication helps provide the team with the most accurate data, and eventually improves analytics performance.
Better customer experience. Duplicate data can cause companies to focus on the wrong targets, and even worse, to contact the same person multiple times. Data deduplication helps provide the customer success team with a holistic view of their customers and provides the best customer experience possible.
When is data deduplication used?
If you deal with real business data, then you have certainly faced the headaches of duplicate data. From customers filling out forms, your team manually entering data, or data imports from third-party platforms, there are certain patterns that create duplicate data and it can be quite difficult to get rid of as a result. Data deduplication can help overcome duplicate data caused by the situations below.
Different expressions. One of the most common ways duplicate data is created in databases is through common terms expressed in different ways. For example, Tamr Inc. and Tamr Incorporated. A human can look at the two records and know instantly that it’s the same company. But databases will treat them as if they are two distinct records. The same problem happens to job titles as well, VP, V.P., and Vice President are good examples.
Nicknames (short names). People are often known by multiple names, such as a more casual version of their first name, a nickname, or simply initials. In the previous example, someone named Andrew John Wyatt might be known as Andy Wyatt, A.J. Wyatt, or Andy J. Wyatt, etc. In all cases, these name variations can easily create duplicate records in a database such as your CRM systems.
Typos (fat fingers). Whenever humans are responsible for inputting data, there are going to be data quality issues. The average human data entry error rate can be as high as 4%, which means one in 25 keystrokes could be wrong. You might run into issues like “Gooogle” or “Amason” in company names and misspelled names such as “Thomas” typed as “Tomas”. In either case, they will create duplicate records.
Titles & Suffixes. Contact data may include a title or a suffix, and those can cause duplicate data as well. A person called Dr. Andrew Wyatt and a person called Andrew Wyatt could be created as separate records and live in different data systems, although they could be the same person.
Website / URLs. Between records on organization website URLs, the field may or may not contain “HTTPS://” or “www.” in them. Furthermore, different records might have different top-level domains, such as amazon.com vs. amazon.co.uk. All of these differences will cause data duplications.
Number formats. The most common ones are phone numbers and dates. There are many ways to format a phone number. For example, 1234567890, 123-456-7890, (123)-456-7890, and 1-123-456-7890. In the case of dates, there are also many ways to represent them. For example, 20220607, 06/07/2022, and 2022-6-7. Number fields are also prone to typos and other issues, causing different representations of the same value.
Partial matches. This is one of the more complex issues and something not easily resolved by traditional rules or simple match algorithms. In the case of partial matches, the records share similarities with each other but are not exactly the same entity. For example, Harvard University, Harvard Business School, and Harvard Business Review Publishing. From an affiliated organization perspective, they are all affiliated with Harvard University. But from a mail delivery perspective, they would be distinct entities.
The result is that there are many ways duplicate records are created in data systems and the actual database will have a combination of these factors that contribute to data duplication. In the process of deduplication, you need to consider many – if not all – of these factors.
How does data deduplication work?
As discussed before, data deduplication has a lot of overlap with data unification and entity resolution. And there is a system of tools dedicated to solving this problem: Master Data Management (MDM). But in its simplest form, data deduplication is just the process to ensure only a single copy of truth, or the golden record is used for analytics or operations.
There are traditional approaches to deal with data deduplication such as data standardization, relying on external IDs, and fuzzy matching with rules. But these approaches only worked partially because of all the variations of data problems mentioned in the previous section and the growing volume of data varieties and data volume.
Data standardization. For small data volumes, standardizing many fields such as dates, phone numbers, and even addresses is possible to solve the problem. But traditional methods such as ETL pipelines can deal with new data sources and new varieties.
Fuzzy match with rules. This method uses a combination of fuzzy matching (approximate string matching) and complicated rules to match potential duplicates. But the number of rules quickly skyrockets when there are multiple data systems in play with each other. Quickly, it becomes very difficult to maintain those rules.
Relying on external IDs. Sometimes, the data itself already has a primary key that you can rely on to deduplicate data records. In the case of people, it could be a social security number. Or in the case of companies, it could be DUNS numbers. But DUNS numbers may not always exist and they are expensive to acquire.
Machine learning. Look for solutions that use a machine learning-first approach to entity resolution. Machine learning improves with more data. Rules do not. Machine learning increases automation and frees up technical resources by up to 90%. Rules-only approaches have the opposite effect.
Persistent IDs. Data and attributes tend to change over time even as you deduplicate them. As you reconcile different records into the same entity, maintaining a persistent ID can help provide you with a longitudinal view of the entity.
Enrichment data.Data enrichment integrates your internal data assets with external data to increase the value of these assets. It also automatically standardizes many of the fields discussed above, like addresses, phone numbers, and other fields that can be used to identify the records, thus making it easier to identify duplicate records as well.
How to prevent duplicates from being created?
Healthy, next-gen data ecosystems have the ability to simultaneously process data in both batch and streaming modes. And this needs to occur not only from source to consumption but also back to the source, which is often an operational system (or systems) itself.
Having the ability to read and write in real-time – or near real-time – also prevents you from creating bad data in the first place. For example, real-time reading can enable autocomplete functions to block errors at the point of entry. And real-time writing through the MDM services and match index can help share good data back to the source systems. With real-time or near real-time reading and writing, it’s easy to identify dupe either blocked it from entering the database, automatically merge it with the existing record, or sent it to the data steward for review.
There are next-generation data mastering tools that have the following components to help deal with duplicate issues and solve the problem that traditional tools can’t solve.
Duplicate data, along with many other pitfalls, is unavoidable in today’s environment where data grows exponentially each day. But with the right next-gen MDM tools, deduping your data becomes much easier.