Tamr Insights
Tamr Insights
The Leader in Data Products
October 25, 2019

3 Ways Machine Learning Can Help You Know Your Customer

3 Ways Machine Learning Can Help You Know Your Customer

I spent many years in the AML and KYC analytics industry where I created analytics and metrics to help identify deficiencies of risk rating methods and converted data into meaningful insight into customer behavior and potential red flags. I also have to confess that doing cool things like this used to be only 20% of my day – the remaining 80% of my day would be spent conducting data quality checks, finding primary and foreign keys to combine my data sources, figuring out the number of blanks in each field, and really just cleaning up and normalizing my data. Inaccurate data yields inaccurate results or, as people say,“garbage in – garbage out”, so cutting corners at this stage often leads to dire consequences later in the process.

How does this really relate to KYC, you might ask? KYC, and AML in general, is an industry that requires comprehensive identification and due diligence procedures for institutions to remain competitive in the market and compliant with regulations. As technology is evolving, a lot of these procedures are conducted using a mixture of data-heavy systems, analytics and some type of automated processes. This doesn’t only affect analytics and data teams – business users consume analytical insight and use tools that rely heavily on data at some stage, such as various KYC case management suites or transaction monitoring software. Without having accurate and comprehensive data from multiple systems and sources, we might not be able to identify that Walter White, a chemistry teacher, leads a secret life as Walter White, the kingpin.

When someone thinks about using machine learning in any industry, forecasting and predictive analytics immediately come to mind. I have no doubt that many readers of this post already use some type of machine learning through their analytics teams or by consuming reports created using machine learning techniques. Instead, I want to focus on how machine learning can ensure data accuracy and completeness, make a tangible business impact, and minimize the 80% of time spent on manual work so that you can spend more of your time doing cool things and leave the grunt work to the machines.

#1 Golden Records

Golden records is a term used to describe a dataset that represents the most complete and truthful version of given records, the so-called “single version of the truth”. One of the main considerations to keep in mind when creating golden records is accurate matching and merging of records, as the same record might have small differences across various sources. Michael Jones from our account owners list can be recorded as Mike Jones in our CRM tool.

Currently, most enterprises use an MDM-based solution to create customer golden records. Each golden record gets created from existing customer data and typically uses a defined ruleset to identify whether a record already exists in golden records. So combining Michael Jones and Mike Jones will require a few rules to accommodate the different spellings of the first name and other minor differences. While fairly straightforward when dealing with a small number of sources and examples, this process becomes challenging as the number of sources and the variety of data increases, as with each additional source new rules and complexities need to be accounted for. The initial creation of golden records this way may take many months depending on the number and size of the underlying data sources.

After creating the golden records, this dataset needs to be maintained to accommodate an increase in the underlying sources and data variety. Incremental updates and changes bring additional challenges to golden record curation. Frequent comparison to existing records to identify new records usually leads to dozens of new rules to be created to increase the accuracy of non-one-to-one matches. As a result, incremental updates to the golden records usually takes several days to weeks to finalize.

To summarize, some of the biggest drawbacks of a traditional rules-based approach are:

  • Creation of new golden records from scratch and onboarding new data sources is time consuming
  • Incremental records require numerous rules to account for typos, spelling changes, and other data quality aspects
  • Accuracy is subjective and dependent on timely identification and creation of new rules

All of these drawbacks can benefit from taking a look outside the box – beyond a rules-based approach – and this is where machine learning shines. Data mastering at scale powered by machine learning unlocks the ability to match records using multiple fields that may have differences from source to source. In fact, Tamr’s patented algorithm goes even further by utilizing a human-in-the-loop approach to use the best of both worlds – robustness of machine learning and subject matter expertise of humans interacting with the algorithm. This human-guided machine learning approach leads to a few notable advantages:

  • Dozens to thousands of rules can be replaced with a model that learns from the data
  • Accuracy is increased by using a human-augmented machine learning, where a human acts an expert reviewer to direct the learning
  • Processing time is reduced significantly for incremental updates and new data source additions

#2 Household/beneficial ownership

With FinCEN finalizing the ruling on beneficial ownership identification, 82% of financial institutions have difficulty implementing or modifying existing systems to account for beneficial ownership data (according to a 2019 AML survey by RSM).

Two of the obstacles enterprises need to overcome here include:

  • Improvements to verification of beneficial ownership beyond customer questionnaire data
  • Lack of an easy-to-use single view of the beneficial ownership network

While the latter is fairly straightforward to theorize, the former requires some clarification. Identification of beneficial ownership is a regulatory requirement in many jurisdictions, including the U.S. Inability to properly identify real owners of legal entities may lead to inadvertent dealings with sanctioned entities, foreign correspondent banks, and direct and indirect participation in money laundering and terror financing. All of these lead to heavy regulatory fines that may reach billions of dollars. In addition to legal ramifications, an enterprise can be exposed to unnecessary financial risk, reputational damage, and loss of customer confidence.

Machine learning can be used to create “household” and “beneficial ownership” views, where an algorithm identifies patterns to combine customers into clusters of customers. As we power through with identification of beneficial ownership clusters, additional sources, such as marketing, CRM and web data can be used to help define these clusters. This can be further boosted by a check against a curated sanctions list that ignores language barriers and transliteration nuances, and we have a winning combo.

#3 Customer 360 and Enrichment

In addition to creating comprehensive clusters of beneficial ownership the same agile approach to data mastering can be used to generate a true 360 degree view of all your customers.

A customer 360 view requires a lot more than simply using customer-provided data at onboarding. In the current connected world, dozens to hundreds of data sources can be used to enrich customer information for various purposes – from maintaining complete records to using PII-stripped data for targeted marketing and new product development.

Interestingly, while over 90% of enterprises rely on customer responses to enhance customer risk profiles, over 50% report difficulties validating this information and only 55% consult with third-party sources to verify beneficial ownership (according to a 2019 Thomson ReutersAML Insights Report). A small but growing number of institutions, now at 27%, started to utilize third-party customer screening enhancements. The biggest hurdles in choosing third-party sources and systems are gathered around data quality – well-structured and accurate data are consistent priorities (2018 Thomson Reuters AML Insights Report and 2019). Another notable hurdle that enterprises need to overcome is the speed of implementation, which depends heavily on the ability of a new data source or system to integrate with your existing infrastructure. According to the 2018 report, 85% of financial institutions are quite apprehensive to bring in new sources and systems due to concerns about the speed of implementation.

As you may have guessed, I am going to suggest machine learning as the solution. Here at Tamr, we use a tried and tested algorithm that enables matching customers across data sources using imperfect fields and attributes. For example, Mike Daniels and Michael Daniels can be the same person while Michel Daniels can be either Michael or Michelle. The algorithm looks beyond the names and will help identify the best way to match customers with imperfect data, significantly increasing speed of implementation and accuracy. Of course, some of this matching requires human experts to review some questions as the second pair of eyes – the algorithm chooses a few items to verify and then fine-tunes itself to become even more accurate. With the biggest hurdles out of the way, enterprises can be on their way to being an analytics driven organization.


As I contemplate some notable cases over the past decade, I can’t stress enough the importance of having a scalable customer mastering process. An infamous Ponzi scheme creator comes to mind as one of the biggest cases in recent memory. A 360 degree view of customers could have saved billions in losses and fines. A bank became well-known for becoming the unofficial “official” bank of cartels by allowing funds from illicit activities to enter their system. A 360 degree view of those customers combined with external news and other enrichment could have led to the story never developing past a customer rejection notice. In the past decade, fines for KYC, AML and Sanctions violations reached $26 billion USD – that’s a billion with a “b”. In fact, regulatory fines reached hundreds of millions of USD so far in 2019 alone! And this is only regulatory fines – financial risk exposure, reputational damage, incorrect asset calculations and inaccurate market capitalization are harder to calculate.

As technology evolves, existing processes need to keep up. There are many ways to implement machine learning in KYC and AML, and data quality and accuracy are areas that are often overlooked by enterprises. Improved data quality can lead to more accurate results and more robust customer identification, and in today’s world, this is not just compliance, but a vital step in risk mitigation.

Stay tuned for Part 2 where I will discuss additional ways machine learning can boost your KYC and AML programs.