Taming the Hydra of Data Variety

Hydra Cropped


Among Big Data challenges, Variety is the hardest to deal with through automation. Variety comes in the form of incompatible data formats, non-aligned data structures, and inconsistent data semantics. These are not merely a consequence of insufficient standards: although accidental data variety is real, an enormous portion of variety is intentional. Data from disparate systems is designed to meet the disparate needs of those systems. In other words, data variety is a human problem: it is created deliberately by humans as we tailor data to meet specific needs. It is not until we collect and re-purpose that data that it becomes ‘variety.’ Because of this, data variety is a hydra that will continuously regenerate, no matter how often we try to defeat it. To succeed with big data, we need to tame data variety, not defeat it. But how?

MDM, data governance, and massive public data harmonization projects are meant to reduce data variety, not accommodate it. In each of these, the work of getting data from disparate sources to fit together, and of getting the results of big data analysis to fit with enterprise data, is left to ad-hoc, manual effort. Existing tools for bridging these gaps are excellent at enabling data technicians to explore and execute the necessary steps, but still leave it to that technician to determine the right steps. It is the fact that this process of understanding and transforming data is human-driven that makes it the bottleneck limiting the value of big data.

Since the problem of data variety has a human cause, it is fair to expect that taming data variety has a human solution. The trouble with existing approaches is not that they involve humans, but that they leave it to humans to figure out what to do – to drive the entire data curation process. Tamr’s approach to dealing with data variety automates as much of the process as possible, directly engaging data experts only when human input is necessary. We describe this system as machine driven, but human guided, and it is this unique approach that enables Tamr to effectively tame data variety. How does Tamr accomplish this?

As data is registered with the Tamr system, it is incorporated into a data inventory – a repository of all the varieties of data known to the system. Although this data inventory may seem similar to some crowd-sourced data inventory projects, the emphasis is on identifying the varieties of data actually observed and in use within an organization (although data exemplars can also be registered). And, unlike other models of crowd-sourced curation, the system engages many kinds of data expert – not just technicians – and automatically manages their workload to ensure that maximum quality can be achieved with minimum effort. Furthermore, this effort results not only in a single collection of high-quality data, but captures curation actions that can be captured and incorporated into source systems, and teaches the Tamr system how to work across the varieties of data, improving automation and accuracy when working with any related data. This automatic re-application of human effort and insight is the key to keeping up with data variety.

Rather than limit data variety, the Tamr system accommodates it, providing transparency and automation to enable other systems to traverse the quagmire of variety and unleash the value of all data.