At Tamr we define “schema mapping” as taking one or more datasets with similar content but varying formats and structure and consolidating them into a single data model with a standard set of tables, columns, and formats. There are no shortage of tools to do this, with SQL or Python code, or Excel-based “mapping rules” specifications. And while this is manageable for a few sources of data, it quickly breaks down as you add more datasets and scale up the variety of sources and formats.
Use cases for large scale schema mapping usually fall into two buckets:
Retrospective: You need to standardize data from many legacy systems or projects.
Prospective: You want the ability to integrate new 3rd party datasets over time and reduce the incremental cost.
Problems with the Rules-Based Approach at Scale
In a large scale schema mapping project, a manual or rules-based approach to schema mapping tends to break down for several reasons.
As you add sources of data you end up writing and testing very similar code for each source and/or creating manual specs based on metadata. It’s simply impossible to anticipate all the variations of naming — including spelling, synonyms, abbreviations, conventions — that may be present across various data sources. Beyond that, there are a number of transformation tasks required to help data fit the target schema, such as combining fields, filtering out records, joining tables, and formatting values. While can be done using rules, it’s difficult to maintain and adding new data sources or changing the target data model becomes very expensive.
Also, because the transformation work gets technical pretty quickly, you have to separate out the work: a subject matter expert (SME) defines the rules or specs (e.g. map these 2 fields to this field and apply this logic) and then a developer writes code to implement this logic in whatever language and make sure it scales. To some extent, this means the work is being done twice: writing pseudo-code in a spec and then the actual code for implementation. Because the SME cannot foresee all the complications in the data at design time, a lot of time is wasted a lot of time going back and forth because neither person understands the whole process.
Why Standardization isn’t the (Complete) Solution
The logical thing to do here is to develop standards and enforce them at the source going forward. Standardization is great, but it tends to fall short for 3 reasons:
Like it or not you will get new sources of data. Your data scientists want to use 3rd party data. Your company will continue to do mergers and acquisitions.
Unless you are building all those source systems yourself, the systems you use will corrupt your beautiful standards in slight ways (e.g. system fields and naming conventions). You don’t want your standard data model to be tied to any given source system.
It’s inevitable that your standard data model will evolve over time, and the more manual this process is the harder it gets to make any changes.
So while standards are the right thing to do, you have to acknowledge that this doesn’t solve all of your problems.
Benefits of Machine Learning for Large Scale Schema Mapping
Machine learning can help you manage the mapping and transformation of many data sets into a common data model in a scalable way by:
Greatly reducing the time to add new sources of data
Enabling a small team to manage many data sources
Providing resiliency to changes in the target schema
Improving the quality of the data by letting subject matter experts do more
How Tamr Uses Machine Learning
Tamr uses machine learning to tackle the challenge of mapping many source datasets with similar content but different formats. Rather than writing and maintaining static rules, the software recommends mappings using a model trained on many example mappings. When new datasets are added to the system, a machine learning model classifies each source column by evaluating its similarity to all the previously mapped columns. Fuzzy matching of both metadata (name, description, etc) and data values for each column makes the system very resilient to the small variations across sources that are so maddening in a rules based approach.
It’s important to understand that machine learning won’t solve this mapping challenge 100% of the time. So a core part of Tamr is the workflow for a subject matter expert (SME) to accept/reject/correct these recommendations, without needing to review every one of them. Additionally, once columns are mapped, the user can implement any needed transformations to help the data fully fit the target schema. This enables you to move from a 100% manual process to an 80-90% automated process without sacrificing quality.