Written by Paul Courtney
Origin story of FAIR
We now have technologies that can store massive amounts of data in data warehouses and data lakes. And yet, one of the most vexing problems our customers still have is to actually find and use the data they require to conduct business in the most efficient and expeditious manner. Data assets are often developed locally, with formats and terminologies known only to the direct owners of the data asset, and so they are separated by distance and technology. Even with the expansion of data centers, entities have not pooled their data but instead have merely created larger, still disconnected, data silos. This comes from an old mindset that any data asset is created only for a single purpose and no thought or planning is put into further use of that data.
The problem of siloed data has persisted in the science community despite efforts over the last 15 years to identify the root causes (technical and organizational) and address them. However, an initiative that began at a meeting in 2014 resulted in the publication of a set of principles, called FAIR, in 2016 that seem to have been able to make progress.
The short version of the FAIR data principles is:
- Findable – can you locate a data asset as easily as you can a library book using the electronic card catalog?
- Accessible – if you know where the asset is, can you use a standard internet browser to navigate to a URL, download the asset and view it without requiring special software?
- Interoperable – does the data use standard terminologies and formats?
- Reusable – if the previous answers are “Yes”, then this means that the data asset is reusable
These four “pillars” are expanded to 15 FAIR data principles and 14 metrics along with some examples that you can find here.
Rich metadata (data describing data) is a key aspect of “FAIRifying” data. For instance, in order for data to be Findable, it is not enough to simply locate the data asset– metadata about the asset should describe the context, quality and characteristics of the data. It is important to know something about the quality of the data (completeness, when last updated, what formats used), how the data asset was created (if the dataset was the result of data integration, what were the original sources?) and perhaps who the owner is in case one has any questions about the data. The metadata can then be stored in a data catalog to inform users in their search for data.
The Role of Tamr Unify in FAIRification
Tamr Unify can capture information about the input data sources as part of the ingestion process for upstream provenance.
Tamr also has metadata available about its own internal processes:
- Schema mapping from input data sources to the target schema
- Any transformations applied to the data to create the unified dataset
For data quality:
- Data profile and quality measures of the unified dataset key attributes such as percentage null, most frequent values, maximum, minimum
- Statistics about Mastered data, such as the percentage of labeled pairs, and Tamr’s Precision, Recall & Accuracy
Statistics about Classified data, such as average confidence, total verified records, percentage verified records
For data description:
- Upon ingestion of input data sources, the characteristics of those sources, such as curator, last time updated, and terminologies and standards used, can be made part of the input dataset metadata
The output dataset from Tamr Unify is placed in the data repository as a data asset and the metadata is captured and made available to a data catalog for ingestion. This metadata is exactly what is needed to FAIRify the data asset and “facilitate knowledge discovery by assisting humans and machines in their discovery of, access to, integration and analysis of, task-appropriate…data and their associated algorithms and workflows.” As you can see, all kinds of data can be FAIRified, not just life science data but mastered vendor lists, classified spare parts and mastered customer lists.