Written by Sohaiyla Khalili
Getting effective new drugs to patients quickly is the lifeblood of every pharmaceutical company, and success in the endeavor hinges on analytical capabilities. The issue: as pharma companies grow and evolve, their data—especially R&D data—is often kept within silos created for particular scientists, experiments, or clinical trials. This fragmentation can make it nearly impossible to access and use data for exploratory analysis and decision-making about new medicines.
But imagine what would be possible if pharmaceutical firms could provide analytics-ready data across all of R&D, not just within departmental silos, in a timely manner. One leading pharmaceutical company did just that. The organization transformed its R&D organization through data unification, improving decision-making about new medicines, and ultimately bringing medicines to patients faster.
The biggest hurdle in the transformation was integrating diverse data. But first, to determine where to start, the R&D data team identified more than 20 use cases for what scientists wanted to answer with the unified data. The team then narrowed the field down to 10 based on the greatest value, importance to driving key decisions, and role in addressing important scientific questions.
They also analyzed what other companies in the industry were doing with data to hone their priorities. They found that the largest group of firms concentrated on real world evidence (RWE) from sources such as patient electronic health records, claim databases, biobanks, and personal health devices. Another group focused on clinical trial data. A third emphasized DNA sequencing data. As part of its transformation, the company wanted to incorporate all three of these types of data in a holistic way.
Finding a Scalable, Fast Approach to Data Management
Once the priorities were set, it was time to begin the daunting task of data harmonization.The company had over 30 billion records that needed to be integrated into a single output format that could be leveraged by analysts.
The pharmaceutical giant quickly ruled out using a rules-based Master Data Management (MDM) approach, based on the fact that it would have taken too much time and manual effort to implement. There were millions of data elements to integrate, and any rules-based MDM approach would be nearly impossible to scale efficiently.
Instead of a traditional MDM, the firm chose a machine learning-driven, human-guided data unification approach that could harmonize and clean duplicate and dirty data spread across many silos effectively. This approach entailed probabilistic matching, or the process of making educated guesses that multiple similar fields refer to the same entity, even if they’re being described differently. The data was to be combined into a single Hadoop-based data lake with three different domains: data from experiments, clinical trials, and genetic DNA data.
The goal was to get 100% of the data into the lake within three months—an inconceivable objective using traditional MDM approaches. In the end, the company was able to process 30 billion records in a single week. The data team also created an integration layer that would allow analysts to work seamlessly across all three data domains: RWE, clinical trial, and DNA sequencing data.
Using machine learning that relies on input from human experts, the team was able to prepare the data to help maximize scientific insights. For instance, it was challenging to combine clinical trials data because of the variance in the way trials are conducted and how their results are recorded. The team ingested the trials data (originally in proprietary internal formats), mapped it to the industry standard format, and machine learning models got smarter and smarter about the process. The initial outcomes were highly accurate, and got more accurate over time, even with relatively little human intervention. This way, clinical trials data is in a consistent format across R&D for exploratory analysis.
Over the course of the project, the company integrated disparate data sources, proved the ability to process 30 billion records in a single week; improved R&D designs and analyses; automated and streamlined data analyses; confirmed both drug safety and efficacy to regulators; consolidated data conversion specs; and most importantly, enabled scientists and researchers to innovate faster.
If you’re attending the Strata Data Conference in New York this September, we’d love the opportunity to demonstrate what our clients call the ‘art of the possible’ in person. Book a meeting with us here. If you’re not attending but would still like to continue the conversation, schedule a meeting with us here.