Written by Ihab Ilyas
Among its well-known challenges, we are getting better and better at handling the volume aspect of Big Data; we buy more machines, we “shard” tables, and we even port solutions to clusters and MapReduce platforms. But it is the “Variety” dimension that is spoiling all the fun. Enterprises struggle to even define the entirety of their data inventory, much less make use of it. To cope with the rising demands to diversify businesses and to cater to wider audiences, business units build their own silos — many of them — fragmenting their enterprise-wide data assets into isolated and independently managed data reserves trapped behind the applications bars. This homegrown data variety only adds to the vast reserves of underutilized data that already exist within and outside the enterprise. It’s a typical case of being trapped in “local minima” with a huge missed opportunity to enable new analysis utilizing all data and drive innovation by fueling faster, better decisions. In short, tackling the variety problem is emerging as one of the highest values of any data management solution.
Leverage All Data
Applications and analytics have been seen as the main — and sometime the only — driver of which data should be curated (i.e., prepared, linked, cleaned, etc.). Only the data thought to be important to the upper layers of the intelligence stack gets ingested, massaged, linked and transformed in order to serve the driver application. The result is a highly application-centric stack that fails to realize the full value of available data. A quote from former HP CEO Lew Platt all the way back in the mid-1990’s emphasized the need to listen to data rather than to the application: “If only HP knew what HP knows, we would be three times more productive.” Today, an alternative approach to curating, linking and leveraging “all” data assets will not only serve pre-designed applications in a much richer way, but will also open the doors for a whole new world of opportunities, applications and even businesses.
From Data Variety to Data Opportunity
Leveraging the large number of siloed and untapped reserves of data calls for enterprise-scale curation and integration of heterogeneous sources with different schema and semantics — and very likely with large redundancy, duplication and inconsistencies.
Fortunately, all enterprises have stewards and custodians who best understand the data, the unwritten semantics embedded in their structure, and the consequences of updating it. And we now have machines that celebrate redundancy and employ it very efficiently to extract high-quality facts. Machine learning techniques, for example, leverage redundancy in data to learn models on how to classify facts, to link related entities, or even to prescribe data curation and repair procedures.
Unfortunately, historically humans and machines have had only limited success in working together to curate and integrate data at scale; human experts are scattered across the enterprise with very fragmented expertise and with even lower capacity to deal with large volumes. And while machines can munch on large amounts of data, they have limited insight into data semantics and cannot make final decisions on updating mission-critical data. Past hybrid human-machines solutions have had programmers writing, for example, data transformation scripts or performing parameter tuning of automated solutions. While these IT personnel best understand the machine, they have very little exposure to the data and are definitely the wrong people to update enterprise data.
Tamr’s approach to collaborative curation combines powerful machine learning algorithms with collective human insight to identify sources, understand relationships and connect the massive variety of siloed data. Tamr’s system employs machines to consume large number of data signals from all available sources across silos to come up with best curation suggestions possible. Data experts are then repeatedly and asynchronously pulled into the curation process to make final update decisions, to provide hints and semantics, or to request more evidence from the machine.
Tamr’s “machine driven, human guided” approach promises to have impact well beyond efficiencies gained from connecting and enriching the full variety of their data quickly and cost-effectively. If enterprises can really leverage all their data and “come to know what they already know,” they’ll have decision and innovation engines capable of driving significant enterprise value.