Why is data curation undervalued now? (But won’t be in the Future…)

Scientist using protective robber gloves for handling dangerous substances and experimentsLast week, James Malone had a great post on this topic of data curation. He brought up a number of great points on technical and institutional challenges, and the difficulty in quantifying the value and urgency of data curation. We’ve come a long way in putting analytics front and center of discovering new therapeutics, but it’s clear we still have work to do to solve the “data” problem.

I’ve been working in the biopharma field for the past 20+ years first as a bench scientist then migrated over to the computational side some 15 years ago during the initial boom of genomics. In the early days, experimental data was collected and could be managed using tools like Excel. In those days, little effort was spent on making the raw or processed data widely available, the primary focus was on making new discoveries and publishing papers. Over the past few years, a number of large scale projects like Encyclopedia of DNA Elements (ENCODE), The Cancer Genome Atlas (TCGA) have generated vast amounts of data that is readily accessible. With projects like these, the effort has been spent on processing and on efficient, thorough analytics. With the accumulation of both small scale and large scale experiments, the next steps for many of these organizations will be combine the two and deliver clean, curated data across both.

By combining information from various sources, it helps us better understand the gaps and overlaps in scientific knowledge, which is especially important as the complexity of information grows. The effort and expense of generating such data is only useful as researchers’ ability to find, integrate and access it. At large R&D organizations, scientists may not even be aware research being done within the company (much less externally). Being able to curate effectively is an opportunity to both improve scientific efficiency and governance, and also change the scientific discovery method.

At Tamr, we’ve been working with several biopharmas on curating public data repositories from EBI and NCBI, which have thousands of studies each. Even though all the data is tabular, it’s still very challenging to integrate. The metadata doesn’t clearly align and some studies have complex (and slight variations) of design. The cost of doing this curation manually would skyrocket with traditional tools. Instead we use a machine-drive plus human-guided approach to automated the majority of the “curation” tasks: mapping attributes and ontologies and data reconciliation. The end result is a repository of curated sources with to a common unified schema using known ontologies and preferred terms.

“Biology today needs a robust, expressive, computable, quantitative, accurate and precise way to handle data. It is time to recognize that biocuration and biocurators are central to the future of the field.”
-Big Data: The Future of Biocuration. Nature.

To learn more about how Tamr is working with biopharmas on data curation register for a demo.