Taming Data Variety

Taming Data Variety: Webinar Abstract

Over the past 20 years, companies have invested an estimated $3-4 trillion in IT systems to automate and optimize key business processes. These systems, which are largely dedicated to a single business function or geography, generate enormous amounts of disparate data that is typically stored in one or more data lakes or warehouses.

Now, with billions being invested in Big Data storage and access and next generation analytics platforms, companies are beginning the analytic prosecution of the data stored in these centralized systems. However, the variety of data collected leads to natural siloes, which are rapidly becoming a bottleneck for analysis. Organizations are quickly discovering that while data lakes may help their ability to manage information by placing data in one location, without proper attention to the curation of the data, these lakes can turn into expensive, unproductive Data Swamps.

During a 30-minute webinar, join data-industry veteran Andy Palmer as he discusses how enterprise organizations are leveraging new approaches to delivering the cleanest, widest view of data to downstream analytic tools.

Taming Data Variety: Q&A From the Webinar

Q: What types of data does Tamr work with? Structured? Unstructured?

Tamr works across your full range of structured and semi-structured data sources, whether large or small, internal or external. For unstructured data, Tamr typically leverages partners to add structure to unstructured sources when necessary.

Details:

  • Tabular data formats, such as CSV or XLSX
  • Relational databases (via JDBC). Tamr leverages PK/FK relationships between your data tables to enhance quality of your enriched data.
  • Semi-structured data, such as JSON, XML and YAML (support for many use cases now, full support coming)
  • RDF
  • Full text (via preprocessing into RDF or semi-structured data)

You may want to download our technical overview and datasheet for more information.

Q: What data formats can be accessed by Tamr? Does Tamr require relational databases, APIs etc… or could it be used for a large number of flat files/CSVs?

Tamr can access any data that’s tabular in nature via its set of RESTful APIs. In one customer engagement, Tamr was used to unify the attributes and records contained in thousands of CSVs files. These CSVs are often used to enrich relational sources like an EDW (e.g. an Oracle Database).

Our technical whitepaper and datasheet contain additional information.

Q: Is this a continuous monitoring process or is there a scheduled process to determine what everything is? What sort of overhead does this add to the environment?

Tamr continuously unifies attributes and records in the background as data changes or new sources come online. This process does not slow down existing operations.

More information can be found in our technical whitepaper

Q: What customers do you have using Tamr today? What are the most common use cases you are seeing?

A few of Tamr’s customers include General Electric, Toyota Motors Europe, Roche, Novartis, and Thomson Reuters. Some common use cases we have seen include Procurement Optimization (unifying supplier records and attributes) and CDISC Conversion (converting and packaging clinical study data for FDA submission).

Generally, we see that large enterprises have an enormous wealth of data that they are unable to leverage for analytics, as traditional ruled-based approaches to data-integration like ETL struggle to scale to their full range of available sources. According to some analysts, many of these enterprises are leaving as much as 80% or more of their available data on the table.

Tamr offers a scalable, enterprise grade solution to that problem. Using a bottom-up, probabilistic approach to data unification, Tamr leverages machine learning to automatically match attributes and entities across your full range of data sources — often accomplishing 90% of the task without human intervention. When human guidance is necessary, Tamr generates questions for data experts and feeds them back into the system. This feedback enables Tamr to continuously improve its accuracy and speed over time, where a rule based approach becomes more fragile as additional sources are introduced.

More information on how this can be found in our Technical Whitepaper.