Written by Afsana Afzal
You’ve likely invested in Know Your Customer (KYC) programs, whether for a traditional reason (risk assessment and regulatory compliance) or for a strategic growth reason (sell more, serve customers better). And they’re humming along, presumably.
Or are they?
If you don’t have clean data feeding your KYC programs, you can’t possibly have a real-world picture of your customers, one that’s trustworthy enough for making critical decisions.
Customer-related data, perhaps even more so than other data, faces an uphill battle in the “clean data wars.” It resides in many silos (e.g., ERP applications, Salesforce.com, CRM systems). There’s the natural “drift” and disconnect that happens when so many different people are creating, adding to, and updating customer data as part of their daily jobs. There’s value in comparing and enriching your customer data with external data (for example, a licensed, “gold-tier” file of vetted data on global companies), bringing more data into the mix.
Traditional data unification methods are too slow, particularly given the variety of customer-related data. Processes like data deduplication, records clustering, schema mapping, and entity resolution and mastering at any kind of scale take too much time to execute properly. Some processes require a lot of skills and intimate knowledge of the data. With traditional master data management software (Master Data Management and Extract-Transform-Load), processes operate according to top-down-developed rules for data flow and logic. Business people define these rules, which then must be interpreted, coded and deployed by programmers. Often this process has to be repeated over and over again, whenever data changes happen and datasets are added.
The result: unacceptable latency for real-time-critical applications like KYC, particularly for businesses with a large, global customer base.
Breaking the Data Unification Logjam
As an example of this process, let’s take a look at a popular KYC application: risk assessment.
A global financial institution needs to perform ongoing risk assessment on its customer list to ensure that they are all legitimate customers (e.g., not shell companies, money-launderers or terrorists or even something as benign as a company with a similar-sounding name to an actual customer).The customers are globally spread out, large and small, and include commercial businesses as well as various governments.
To get that real-world view of its customers, the institution needs to develop a deduplicated and up-to-date list of customers. Here are the hoops the institution has to jump through to get there:
Data Ingestion: Ingest the data from the various siloed data sources.
Schema Mapping: Align the ingested data to a canonical schema. Wade through the various datasets for similarly named fields for the same thing, such as “Org Name” “Name “ or “Organization”. One dataset may have multiple columns such as “Org Name“ “Alt Org Name” “Alias” (e.g., the company’s stock symbol), and others may have duplicate columns (frequent). Resolve and map data into a unified attribute called, for example, Organization Name. Integrate this into your schema. Repeat for other attributes as necessary.
Transformation: Clean up the resulting data, such as the unified data or two input columns that prove duplicative upon further examination. Put it into the shape and form the institution prefers (this is the “T” in ETL).
Mastering: At this point, the data has been mapped to a unified schema and into the standardized or normalized format desired. It’s pretty clean, but it still has duplicate rows. In the mastering phase, knowledgeable people must define what records may be duplicates and label data according to business rules to create clusters of like records.
Validation: This involves people going through similar clusters of records and making sure they are being clustered appropriately. They add more labels and other metadata to make the records searchable and (hopefully) findable down the road.
Golden Records: Finally, IT/data experts create a single canonical, validated record that describes the entity, based on the institution’s business needs and desired view of the data.
Sounds relatively simple, right?
Now imagine that you have one million records–or more. (We’ve seen it). And that you’re going through them on Excel spreadsheets (We’ve seen this, too).
Machine Learning to the Rescue
By automating data unification using machine learning wherever possible, the institution can break the logjam in getting to (and maintaining) that up-to-date real-world picture of its customers.
The Tamr system uses a machine-learning-driven process with a human-in-the-loop component. Once trained, Tamr models can automate ~80% of the unification work, invoking knowledgeable humans only to resolve disagreements between data records or resolve outliers by answering simple “yes or no” questions. Here’s how the Tamr system simplifies the time-intensive activities above:
Data Ingestion: Tamr works with semi-structured and structured data. It can take all of your datasets directly from the source, in whatever format.
Schema Mapping: Tamr can take your defined schema or data model, import it into Tamr and create links between the columns from your input datasets and unified-attribute target columns. We can define a schema for you if you don’t already have one. Provide Tamr with a couple of examples (two to three for each of your target unified attributes, such as “Organization” or “Phone Number”). For the rest of your datasets, Tamr models will learn from the examples. They will look not just at the column name but also at any associated metadata or descriptions as well as the actual data within those columns. Tamr can thus deduce that numbers separated by dashes is probably a phone or fax number, words with an @ sign are probably email addresses and so on. If you provide examples from your first two datasets, Tamr models can use them to automate the mapping for the rest of your datasets in your KYC project.
Transformation: Tamr’s transformation engine, built on Spark, can easily handle full-scale data manipulations and transformation. So you can normalize and standardize your data when it comes in.
Mastering: Here, the Tamr system really shines with its ML-driven, human-guided approach. Tamr automatically asks subject matter experts (SMEs) to resolve differences between records and label data using simple “yes or no” questions. To get to a unified entity, our financial institution can modulate the process based on business goals and needs, creating a custom model. Tamr is able to learn from the data signature of those labeled records or labeled pairs, interpolates the data, builds a machine-learning model and then applies it to the rest of the data.
SMEs need only provide a handful of responses; the machine does the rest. A process that could have been excruciating and prolonged–creating models–is now very easy, automated and (most importantly) scalable across the KYC project.
Validation: Tamr provides a user interface for viewing record clusters, filtering them according to Tamr-generated confidence metrics based on the data signature or according to other metrics of your choice. A sample metric might be whether the customer is a company, government or NGO, all of which can have different risk profiles and KYC compliance requirements. ML Alert: By validating the records most important to you and making sure that Tamr is clustering them properly, businesses can be confident that the same learning is being applied to the rest of their datasets by the model. In other words, you can validate just 1% of your records; Tamr models then validate the remaining 99% properly. Anyone can open up Python and create an ML model but only Tamr has the infrastructure to apply that model in a scalable, accountable way.
Tamr also creates lineage, capturing the internal logic, labeling and organizational knowledge behind your data unification. Has the business definition of a unique entity changed since last year? A new data steward can easily locate the history of a piece of data, correct the original labeling, maybe put in a comment and retrain the model. Complete visibility of the Tamr process allows businesses to keep their organizational knowledge in a single place instead of in the heads of individuals.
Golden Records: Tamr speeds the creation of golden records to any level of granularity, from required fields to desired metrics.
As Tamr data ops engineers, we start by defining the questions that most matter to your business (metrics). We then train our models using the data yielded by your current unautomated data unification process, including applying those carefully developed rules where they are vital. For example, we applied the financial institution’s rules to their risk segmentation and classification/categorization steps, something not easily capturable by ML, while Tamr went solo for the entity-matching step, for which ML is perfectly suited. Tamr can augment your process with machine learning when it’s a good fit, and take a step back when it’s not.
Thanks to its human-guided machine learning, Tamr can let the data speak for itself, without the additional, time-consuming and potentially confusing layer of business rules and extensive, error-prone human involvement.
It’s About Time
KYC truly is in the eye of the beholder. It’s different for every company. Our financial institution above had very strict requirements and process requirements for risk assessment, and rules that were working for them. A retail company with a customer360 program might only need a deduplicated “golden record” with some profiling of their customers or a particular customer or their top 1,000 customers.
But the basic approach is generally the same. And a common thread is time and effort.
An ML-driven, human-in-the-loop approach to unifying customer data can save enormous time in data unification by reducing the amount of manual effort up front and by speeding availability of clean, current and correctly classified data to business analysts or applications, improving analytic outcomes.
Once you set up the technical processes aligned with your business and the metrics that matter using Tamr, it’s much easier to add new questions, new data sets, and so on. Once created, Tamr models run and update (increasingly autonomously) in the background.
Here are some of the results we’ve seen from applying this approach to the spectrum of data unification activities in various KYC programs:
- a 75% reduction in the manual effort involved in customer-data integration and delivery of clean data to the company’s next-generation analytics platform (health care)
- ingestion and profiling of 35 large data sources with 3.7 million rows of data to produce 325,000 clusters of customer records, all in less than six months (financial services)
- ability to onboard a new system from landing data to mastery in just 5-7 days and to create a new golden record in a maximum of two days (financial services)
- a 40% reduction of duplicative customer records to feed a customer360 program (manufacturing)
Whether you’re an established KYC user or a newbie, Tamr can help you understand what data you have, unify it across data silos to get “ground truth,” and then keep it continually updated using ML-driven updates to your data models. Fast. Truly giving you the Power of Now.