Those of you who found your way to this blog are probably quite comfortable with the concept of ETL (Extract, Transform, Load) and its cousin ELT (Extract, Load, Transform). ETL is defined as taking data stored in one location, transforming it for quality or normalization purposes, and loading it into another location. The value of this architectural pattern is the logical separation of processing steps. ETL, when properly implemented, enables anyone in the business to easily grab and immediately use data, without struggling through a complex database structure or quality issues that threaten the validity of analysis. Data transparency, accessibility, and quality are the core purposes of ETL. However, the most common ETL practices today, programmatic ETL programs, Metadata driven ETL, Master Data Management (MDM) solutions, cannot alone cope with the rapidly growing and diversifying data landscape. In order to take full advantage of the opportunities of big data, a probabilistic, algorithmic approach to data-unification can improve results obtained through current ETL processes.
ETL, like most software, has evolved and expanded since its inception. Traditional, code-driven ETL, involve functions that read data, potentially change, normalize, or aggregate it, and determine join conditions for various data sets. These rules are often designed by a domain expert and implemented by an ETL developer based on assumptions about the data. For example, a join condition might be that records when the first name, last name, and address are the same, they in fact represent the same person (i.e. WHERE a.first_name = b.first_name and a.last_name = b.last_name and a.address = b.address). One drawback of programmatic ETL is that in order to add a new data source to the ETL rules, the team has to create a unique set of rules specific to that new source. This means that the cost to integrate each additional source continues to rise as the rules become more and more complex, intricate, and numerous. Another version, Metadata Driven ETL, uses metadata such as database schemas, data format and type, and primary key/foreign key relationships to automatically drive programmatic ETL. This design is fragile, as it depends on pre-existing knowledge of the data, and the metadata is often incomplete or out-of-date. Especially with heterogenous data, metadata becomes unsustainably complex.
MDM came on the scene as an additional focus on quality, and promised to provide a method and workflow to organize your data ecosystem and maintain data systems. The problem with a successful deployment is that effective MDM depends intensely on data quality for profiling and recommendations, and data expert knowledge for manually creating rules and for entity resolution. This dependence on top-down rules and assumptions creates a system brittle to data changes or an additional sources, requiring constant manual maintenance. Even accurate rules only create more manual effort, as any attempt at entity resolution requires human input to validate. While aided by data profiling, less than pristine data can easily be mis-matched or ignored by an MDM system. Clearly, a tool that requires constant supervision and cannot work with data in its current state is doomed to failure, or, at the very least, to an expensive, time-consuming, and imprecise production.
The next ETL evolution is understanding and responding to the realities of working with data at scale. Data is constantly changing, growing, and no one person has the necessary knowledge. Rules don’t extrapolate to dynamic data. Rules don’t scale when linking hundreds or thousands of sources. Rules based on top-down assumptions of the data don’t scale across businesses where resources and data knowledge are as segregated as the data sources themselves.
Using a probabilistic, bottom up approach places the bulk of the data unification effort on machine learning algorithms that look across all your data for signals. This allows for a more flexible and holistic approach to data unification, as the model uses the data in its entirety and extends efficiently to every additional data source added. Experts provide feedback to the machine learning by answering basic yes or no questions, which the model uses to learn and to determine its own precision. This means that the model is more flexible and precise where specific rules break. It means that the experts don’t impose rules on the system, the system learns holistically and probabilistically from the experts’ input in order to generate a model rather than a set of rules. This approach can partner with existing ETL and MDM systems, using the process already in place for additional signals, and making recommendations for appropriate ETL rules and transformations.
Tamr takes the bottom-up, probabilistic approach to data-integration. Machine learning algorithms perform most of the work, unifying up to 90% of available attributes and entities (person, place, thing etc…) without human intervention. When human guidance is necessary, Tamr generates questions for data experts, aggregates responses, and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed.
The benefits to such an approach are numerous. Time to analysis can be greatly accelerated, as integration projects that previously took months to finish can be completed in just days or weeks. Further, greater transparency is provided around available data-sources and the data-experts responsible for managing them.