Design Hubris in Data Hubs -- Why Variety Requires a New Design
Every data-driven organization wants their analysts to use accurate and extensive data, especially in a world defined by radical data variety. Tamr takes a probabilistic approach to curating data variety at scale — in a way that fits into the current enterprise data architecture. This post explains the lessons we’ve learned from traditional data variety systems, like the data hub, and how those fit into Tamr.
By now we’re all familiar with the data lake model: centralize all of the enterprises data to build new analytics and data sets. Data lakes are inexpensive, flexible data stores, but — as Jerry Held wrote recently in this blog — don’t provide any capability to manage variety. As a result, some enterprises turn to data hubs, primarily to maintain data across systems. For instance, customer data is often spread across transactional, analytical, and application-specific systems — even when all of this data is dumped into a lake. The data hub is supposed to provide the capability to help keep customer records correct and up-to-date in and across each one of these systems.
While this is a critical capability for any enterprise, hubs rarely meet their own high expectations. The number one reason that these projects fail? Hubris in the design phase combined with rule-based tools used in implementation. Over and over again, decisions are made based on false assumptions about data variety only to find that, unsurprisingly, database administrators and analysts (even the mystical data scientists) can’t deterministically manage the massive complexity of enterprise data. We see this most blatantly in the MDM world, where 6-month projects are routinely re-estimated at 18 months. But, really, whenever data has to be unified across systems, homogeneity assumptions cause bad decisions.
Data variety exists because different applications actually require different models for the same data: there usually is no one-size-fits all for data across a whole organization. While the traditional top-down approach is unsustainable, the capability a data hub allows is very useful. Therefore, it’s useful to review the hub architectures and draw analogies to Tamr.
Hub-like Architectures
The data hub falls into a spectrum defined by two architectures: from registry to repository. The registry model only contains keys to find all the related records in the different systems and mappings between these keys. Implicitly, records are never cached. Instead, large, distributed queries are run in each of the different systems, enforcing transient integrity at run-time. The repository model, on the other hand, continuously maintains data records in a single DBMS. Both models have desirable qualities and are not mutually exclusive.
Registries are often preferred because there is no need to change the underlying data store. The registry hub can join records across each of the systems without ingesting these records and run data quality transformations, deduplication and rationalization in-place. Without persistence, though, it’s challenging to build a strong notion of correctness since you can’t version or track the lineage of individual records over time. Repositories are bulkier constructs, but allow for a much stronger sense of data fidelity. Usually applications and analytics are instrumented to consume the ‘golden’ data set from the repository. The repository’s challenge is coming up with a single data set that meets the requirements of all the possible downstream systems.
Most systems in practice follow a hybrid approach, managing only a limited set of important data and metadata with a system of top-down rules. For example, a financial data management repository might support the three most valuable, overlapping fields: Account Name, Number, and Branch. Rules may be applied to fix these columns across the sources, for example to ensure consistent formatting of Account Name. Such a system may also track the relationship keys of these accounts in the most critical operational data stores to ensure that the operational systems are using the same values as the repository. The problem with this design pattern is that it’s limited and static. It’s very challenging to maintain fidelity, even in just those top sources and fields, much less extend to less common or new fields that might hold the key to answering valuable analytical questions.
Tamr’s Hybrid Architecture
Tamr also follows a hybrid architecture, the key difference being that Tamr embraces the variety and helps organizations take advantage of all their data, not just a limited set. By combining probabilistic machine learning with experts to manage data variety at scale, Tamr is able to extend the registry and repository model to tens or hundreds of data sources. Instead of just supporting the operationally valuable fields, Tamr stores and tracks all fields from all sources.
At a large pharma client, for example, Tamr has built up a dynamic schema across thousands of excel spreadsheets. It’s easy to add and remove fields as sources change or new sheets must be integrated. And instead of tracking the keys in the most critical operational systems, Tamr manages entities and all corresponding ID’s, whether they’re in supported data stores, OLAP systems, or even external data sets. Finally, since we assume that data sources are constantly evolving, the attributes and keys are continuously tracked. That way the registry and repositories can easily be synchronized to up or down-stream sources.
Whether you’re on-prem with a Cloudera hadoop cluster, distributed with SAP HANA, or in the cloud with AWS Redshift (just to name a few), you still need a unified view across different systems. Even the most scalable storage and compute infrastructure needs data usability and fidelity. Having a comprehensive registry and repository means fewer one-off data integrations, more direct data flow, and new signals to drive impactful analytics. A connected enterprise data ecosystem that is machine-driven, human-guided.