Written by Paul Courtney
After a successful early career in R&D in Silicon Valley, I spent 12 years working as a carpenter.
This may sound like a big U-turn. But, while I loved the intellectual piece of science, I really loved the people aspect of construction. I got to build something and turn raw materials into gratifying, highly visible results: houses that enabled life and buildings that enabled commerce.
Fast-forward to 2019. I get the same kind of rush daily as lead data-ops engineer for Life Sciences at Tamr.* And for good reason.
While the people, processes and tools involved in data pipelining obviously differ, the building method and life-changing results are similar. Modern data pipelines help turn the raw materials of life (clinical research data) into analytic outcomes that, in turn, speed the availability of new life-enhancing and -saving therapies.
In Life Sciences, never have we had more and better raw materials with which to work.There are oceans of data continuously generated by R&D researchers and clinicians. New tools for capturing, organizing and sharing data, such as the cloud, data lakes, AI/ML and processes like Data Ops. New skills (such as data scientists) that help turn diverse data into “finished goods,” or data usable by many different people and analytic models.
But some critical building processes haven’t kept up: specifically, data unification. With all these assets, success in Life Sciences hinges on access to clean and continuously updated data.
New Construction Method Required
Current data-unification methods–traditional, rules-based Master Data Management (MDM) and Extract-Transform-Load (ETL)–can’t cope with the variety of structured data involved in Life Sciences. Variety is further complicated by growing data volumes and volatility.
The result too often is that significant resources need to be devoted to cleaning “dirty” data, which slows down R&D, delays drug approvals, and impedes scientific discovery.
Using rules-based methods, programmers must engage in an endless cycle of iteration. As data is created or changes, programmers must collect business rules from subject matter experts (SMEs), draft each new rule, have it reviewed by the SME, change the rule, iterate it until it is validated, and then deploy the code. This is time-consuming, people-bound and error-prone.
But what if you could apply AI/machine-learning to this process? ML models could do much of the work. Programming becomes unnecessary. Once created, models automatically compare and update data to the current models when changes occur. Data conflicts or outliers are resolved using an integrated, human-in-the-loop component. This is the philosophy behind the Tamr system.
Data unification becomes more scalable when you can leverage ML to largely automate data tasks like classification, schema mapping and entity mastering. And clean and harmonize data across life science entities such as assays, biomarkers and clinical studies to populate translational data repositories.
By taking this approach using the Tamr system, GSK created unified R&D data lakes from fragmented research and clinical study domain silos, and built robust data pipelines processing over 10 billion records a day. GSK mapped over 40,000 datasets across 1,000 studies to 36 different SDTM domains.
How It Works
Modern data unification first requires deeply understanding both the data (the raw material) and the desired output (the “data houses” that our customers want/need). This is basically the “custom plumbing” that exists in every company. There’s no “silver hammer” for this step. Tamr data engineers can help here (and we learn a lot with each new customer).
With your foundation and infrastructure in place, you next build data models that will eventually run as independently as possible, reducing constant rules-juggling, manual classification or hands-on schema updating. The Tamr system essentially uses AI/ML to clean the “low-hanging fruit” data (~80%), automatically involving knowledgeable humans when necessary for resolving unclear data relationships or outliers. The results eventually get fed into your data models, which get smarter over time as SMEs train Tamr and work more efficiently with Tamr running in the background.
An important point: Just like you still need to occasionally employ electricians to keep your new house running (unless you have a death wish) or painters to keep it marketable, you will always need people in keeping your data models optimized and running. For instance, if you introduce new input datasets that are significantly different from existing ones, the previously developed model for schema mapping and transformation Tamr was trained on will need to be updated with new training. And your current, time-tested tools (like MDM, with its carefully developed rules) may not go away, but just work better and more efficiently (with better, cleaner data).
But it’s all about keeping the right people involved and doing the highest-value things with your data. Therefore, for example, we give data scientists an interface where they can train critical data assets or reconfigure the machine learning parameters and keep monitoring them. Subject matter experts have an interface that makes it simple and fast to validate a piece of data or resolve a question about conflict between two pieces of data. Tamr can thus efficiently capture and codify some of the “tribal knowledge” in your organization, making it scalable and repeatable. Everyone–and every process–gets continually smarter.
How Life Sciences Benefit
This new process can benefit Life Sciences pretty much across the board.
CDISC Conversion: Pharmaceutical companies have found that converting their clinical and labs data (which may have come from multiple different CROs) to the CDISC SDTM standard enables, for instance, analysis across studies and diseases. This can shorten the discovery phase of potential new indications for existing compounds.
Further, submitting live data to the FDA for approval of a new drug must adhere to the CDISC SDTM standard. But unifying, cleaning and converting diverse data to the SDTM standard is time-consuming, people-bound and expensive. Clinical, biomarker and biospecimen data necessary for analysis approvals typically resides in many different systems, eCRF and LIMS, created by different people at different times for different reasons using different naming conventions and in-house standards. Whew – that’s variety.
Being able to get data ready in CDISC models quickly and cost efficiently could conceivably streamline both the discovery and submission phases.
Pharmas may also want to maintain their own in-house standards (for SDTM and other parts of CDISC), which will naturally “drift” as new data is captured from research. New discoveries, changes to previous discoveries and so on invariably create variety, inadvertently resulting in new, de facto “standards.” The Tamr approach can help adjust to standards drift.
Real World Evidence Integration: In 2016, the FDA recognized real world evidence (RWE), such as physician findings from off-label use of data, as legitimate for proving efficacy of new therapies. Coming from diverse sources, RWE invariably introduces more data variety, which Tamr-ized data models can resolve. For one customer, Tamr built a robust pipeline integrating 30 billion records a week from RWE sources into 100+ billion record outputs. The Tamr system could help shorten FDA approval time and cost by incorporating new sources of data faster.
R&D Data Monetization: Because of the monumental effort required in preparing CDISC-compliant submissions for the FDA, potentially important clinical data not used in submissions often gets relegated to storage. Where it languishes. Who knows what value exists there for the future? Most pharmas may never know. With Tamr, you can afford to use all of your data, cutting through its variety to make it accessible and usable in data models and clinical data lakes.
The Tamr approach also helps address two related problems: increasing data volume and data volatility. Tamr customers like GSK have data models and complex analytics cranking along against huge volumes of incoming data, including very verbose data like biomarkers. Because the Tamr system is based on Spark, it can hand off gigabytes of data for processing without slowing down. Ditto for legacy data.
Life sciences R&D data is constantly changing. Discoveries are happening all the time. For example, a molecule was originally thought to be part of a particular pathway, but further research indicates not. It’s part of something else, requiring a new way to measure it. Or new datasets come in; the data doesn’t look quite right, perhaps the name of the variable has changed significantly. Tamr models can be retrained quickly to adapt to such changes, rather than programmers having to be constantly writing new code snippets.
All of these use cases benefit from cleaner, more current and continuously updated data, which improves analytic outcomes.
Life Sciences customers can benefit from Tamr throughout their operations, not just in R&D. Simplifying data unification can help companies in every phase of bringing a new drug to market, including:
- Procurement Analytics: Be more efficient in global supply chain management by saving money in acquiring the tools, parts and materials needed for R&D and operations.
- Customer Analytics: Get a 360-degree, unified and segmented view of healthcare providers and payers, and how they are distributing medical products in the market.
- Product Analytics: Unify product sales data from distributors and retailers with internal product data to understand what’s selling, where and at what price.
- Healthcare Analytics: Reconcile all stakeholder and data transactions throughout the healthcare value chain.
* Eventually, my interest in programming and a fortuitous job supporting a clinical research project at Dartmouth Medical School sealed my “fate,” getting me hooked on Life Sciences forever.