Written by Aidar Orunkhanov
You have probably experienced a simplified version of low-latency matching in action. When starting the checkout process at many popular online marketplaces you start typing your address, and the address bar starts showing suggestions to autocomplete your address, even if you have never shopped at the site before. This approach is low-latency, as the autofill suggestion shows up nearly-instantaneously, but it does not use any sophisticated matching on the back-end which means that a single-letter deviation from the correct street spelling will no longer yield the same suggestions. It is near-instantaneous for a reason – the entry that you started to type is matched one-to-one to existing records to find a match, and if no match is found, it simply shows nothing. This is where machine learning comes into play – allowing the more sophisticated matching to happen on the back-end, while still allowing near-instantaneous matching. The key difference is that with machine learning, prior to displaying suggestions, the entry is compared to existing records using complex algorithms to find the best match. This is true for Tamr’s low-latency matching method as well as it returns the results in near-real-time, so you can carry on with the rest of the process using the accurate address.
The same technique has tremendous application in KYC, and it can be used at the stage in the process that is most prone to data-entry errors – the initial onboarding. Onboarding is a term to describe the process of introducing a new customer to your enterprise. In the financial world it involves a range of activities required by regulations (and risk mitigation practices) such as customer identification, due diligence/enhanced due diligence, and risk rating. Most of these are completed in the form of several automated and a few manual checks. Some of these checks require screening the customer against specially designated nationals or politically exposed persons lists, or looking for adverse media and company ownership. Other checks involve comparing the customer to the master list of existing or rejected customers to ensure accuracy and prevent record duplication. If the customer is identified as an existing customer, then all prior information can be reused, otherwise a new record is created.
And in reality, none of these checks would be meaningful if the very initial biographical information is inaccurate. Unfortunately, onboarding is the process that is most prone to data-entry errors. This can be attributed to various factors – some as simple and innocuous typos and pressing “Tab” too early after typing in a customer name or address – and this is how customers from South Korea (commonly saved as “Korea, Republic of”) get assigned to North Korea (“Korea, Democratic People’s Republic of”). Some can happen due to customers presenting documentation with different spellings of their name or due to name changes. While the majority of these mishaps are harmless, some can create multiple people for money laundering purposes or to bypass sanctions screening.
Low-latency matching will allow near-real time comparison to existing records and a variety of sources at the onboarding stage to ensure that potential new customers are not existing customers and are not on any reference lists. As machine learning can support the use of non-exact matching, spelling changes and usage of the non-latin alphabet can be accounted for in the learning, which will help in two ways right from the beginning:
- prevent duplicate records for the same customer, essentially ensuring accuracy of all records
- prevent any potential fraud or attempts to bypass regulations
Enhanced Reporting & Visualization
This is the most important point of all. As the world is becoming more data-driven, making smart data-driven decisions become the norm. And most of us already know that machine learning is used in data science to uncover never-before-seen insight. You probably are already using some degree of machine learning in your current reporting, visualization and decision-making. As I mentioned before, any insight coming out of any reporting suite is only going to be as accurate as the data going in. Using machine learning to master your data will improve your enterprise’s data accuracy. The distribution of the customers you thought you had before might be completely different from the actual picture – the small-business-oriented enterprise with a few large customers can actually be uncovered to be dominated by medium-to-large-sized customers, which becomes clear once all your customer data is mastered and duplications and related entities are properly combined.
See the chart below that shows the customer segmentation before (on the left) using machine learning to master your data and the customer segmentation after using machine learning to master your data. Very different pictures indeed!
One thing to consider when choosing a machine learning solution is to ensure that it will allow you to continue using your favorite reporting and visualization tools, now enhanced with mastered data and dozens of new data sources. Tamr’s powerful API standardizes connectivity to most existing analytical software. My favorite visualization tool is Tableau. In addition to robust functionality, it plays well with extensions and methods to improve your data quality.