Data curation is the cornerstone of modern business intelligence pipelines. Data must first be “Cataloged”, and “Connected”, before it can be “Consumed” by applications and various analytics. In this paper, we describe these three tasks and how the Tamr platform addresses the engineering and algorithmic challenges.
Data has to be first “Cataloged and Profiled”, then “Cleaned and Connected”, before it can finally be “Published and Consumed” by applications and various analytics. While very intuitive, these three functionalities are strongly related to three fundamental and notorious
technical challenges in the data cleaning and integration literature: data profiling; schema integration and record linkage; and finally, information exchange and data consolidation.
At the front line of incoming data from the various data sources is the task of ingesting/registering a data sourceinto the enterprise data repository. The reality is data integration is “never done,” and new data sources and new data records come regularly and continue to challenge the current “global schema.”
The connection layer is responsible for realizing connections among data attributes and features and among data entities and records. These tasks provide the glue that will enable multiple consumption models of the same data repository and will allow applications and analytics to define their own “global” schema much later in the processing pipeline.
Connections among the various input schemas and records enable a diverse set of consumption models that varies from: (1) pushing these connections into the organization ETL solution, to (2) enriching existing data sources with new connections and mappings discovered by Tamr, and (3) creating a consolidated view of the data with the most prevalent global schema and unified data records that match the enterprise existing warehouse.
Fuel your enterprise’s most critical decisions with ALL your data — cataloged, connected and enriched by Tamr’s data unification platform