Next-Generation Genomics Analysis with Apache Spark
The genome is “the blueprint of life,” a repeating string composed of four letters, whose order and configuration lay the plan for each individual’s growth and development. Genomics is the study of the structure, function, and evolution of genomes at a variety of scales: from the single cells of a cancer tumor to the genomes of an entire population of individuals. Scientists use “sequencers” to look at the molecular structure of the genome much the same way that astronomers use telescopes to examine composition of stars, and what they see with these molecular telescopes holds the potential to find new drugs, diagnose patients, uncover the genealogy of entire populations, and discover the genetic bases for human disease.
The genome is “the blueprint of life,” a repeating string composed of four letters, whose order and configuration lay the plan for each individual’s growth and development. Genomics is the study of the structure, function, and evolution of genomes at a variety of scales: from the single cells of a cancer tumor to the genomes of an entire population of individuals. Scientists use “sequencers” to look at the molecular structure of the genome much the same way that astronomers use telescopes to examine composition of stars, and what they see with these molecular telescopes holds the potential to find new drugs, diagnose patients, uncover the genealogy of entire populations, and discover the genetic bases for human disease.
Genomics is also in the middle of a massive technological revolution; over the past decade, the sequencers used by scientists have improved in cost, quality, and speed at exponential rates. Fifteen years ago, it took billions of dollars and years of work for an international consortium of researchers to produce a single human genome; today a single sequencing center can sequence a human genome in a single day for almost $1000. Thousands of human genomes have been sequenced, and projects to sequence hundreds of thousands or millions of genomes are already underway.
Even as the experimental machinery of genomics has advanced, however, its computational support — the tools and methods that convert raw data into clinical findings and research discoveries — has not kept pace. Genomics software today runs much the way it did ten years ago: discrete tools, scripting for workflow, files instead of databases, file formats in place of data models, and little-to-no parallelism.
Spark is an ideal platform for organizing large genomics analysis pipelines and workflows. Its compatibility with the Hadoop platform makes it easy to deploy and support within existing bioinformatics IT infrastructures, and its support for languages such as R, Python, and SQL ease the learning curve for practicing bioinformaticians. Widespread use of Spark for genomics, however, will require adapting and rewriting many of the common methods, tools, and algorithms that are in regular use today.
This talk will present ADAM, an open-source library for bioinformatics analysis, written for Spark and hosted by the AMPLab. We will discuss both the places where Spark’s ability to parallelize an analysis pipeline is a natural fit for genomics methods, as well as some methods that have proven more difficult to adapt. We will also cover ADAM’s use of technologies like Avro, for schema specification, and Parquet, for compressed file formats, in conjunction with its Spark-based workflows.