Written by Keziah Katz
If you need to analyze big data, the first step is to convert the data into a format that enables analysts to ask common questions quickly. This often involves reshaping data to enable frequent queries and bringing together data from multiple sources. Preparing big data for analytics requires scalable data processing and an approachable user interface. Over the past year, we have been transforming billions of health care records for the pharmaceutical giant GSK. This is the largest data set processed through Tamr Unify, but thanks to Apache Spark, used as our underlying compute engine and our robust user-friendly transformations system, we were able to efficiently complete computationally intensive tasks.
Our goal was to convert purchased Real World Evidence (RWE) data to the OMOP Common Data Model. Real World Evidence is medical information collected from sources like hospital records and insurance claims. It is used to analyze the effectiveness of medications in a real-life setting. As this data can be collected and purchased from a variety of sources, a common data schema is needed to simplify analysis across sources. The OMOP Common Data Model was designed as a common format for observational medical data and was chosen by GSK as the common schema for their analyses. The conversion required many computationally intense transformations including group-bys, joins, and unpivots.
It Started with Testing
In order to understand how Tamr Unify would perform these transformations on billions of records, we ran a number of tests. We subsampled our largest input dataset (10 billion records) and ran jobs with a single transformation of each type. These tests allowed us to understand in a controlled fashion how execution time increased with scale and which transformations are most expensive. We found that the large scale join was the most expensive operation. We used this information to guide how we wrote the transformations for this project and avoided large scale joins wherever possible.
Figure legend: All runs were completed with 40 executors with 2 cores and 40GB of memory each. Unpivot was completed on 2 columns. Join was a self join (equally sized tables with 1:1 keys). Group-by was a set collection on 9 aggregated columns by 1 key column.
Optimizing the Spark Configuration
We were using shared compute resources for this project and wanted to determine how job execution time was influenced by various Spark settings. This allowed us to determine how many resources to request in order to collect our jobs in a reasonable amount of time without monopolizing shared compute resources.
We first examined the value of allocating additional resources to the Spark driver. We found that driver resources had little effect on execution time, even when transforming 10 billion records. We also wanted to determine how to best distribute requested cores between few executors with many cores and many executors with few cores each when transforming big data. We found that a balanced approach reduced both execution time and the likelihood of job failure.
Figure legend: All runs completed for a Group-By transformation with 9 columns aggregated using collect_set on 1 key column.
Figure legend: All runs completed for a Group-By transformation with 9 columns aggregated using collect_set on 1 key column. Transformation of 5B records with 20 executors failed to complete.
Leaving the Lab
The next step was to leave the ‘lab’ and apply our findings to writing real transformations on the real data. Tamr Unify’s preview feature allowed us to quickly test transformations on a sample of data before submitting full scale Spark jobs, allowing us to fail fast and quickly develop the transformations required for conversion to the OMOP Common Data Model. Some input datasets required very similar transformations which resulted in a natural test of scaling in our real use case. The previous charts were produced under simplistic and highly controlled conditions; they each tested a single transformation on a single dataset. A practical, and more realistic example is shown in the chart below. Here, each point represents a Tamr Unify project in which 3 full scale datasets are combined. These tables vary in both column and row number, but the same basic transformations are applied to all of them. Some columns are transformed linearly; there are two joins onto tables containing millions of records and there is a single filter. For this combination of transformations, execution time scaled linearly with the increase in record count.
Figure legend: These jobs were all run with 60 executors with 4 cores each and 40GB of memory.
Tamr Unify is able to apply complex transformations to billions of records, demonstrating that it is a scalable data processing system. We are currently processing billions of records for GSK and are excited to work at this scale with more customers. If you’re interested in a demo, please reach out.