A scientific team at the University of Chicago reports that genome analysis can be radically accelerated by relying on supercomputers such as the one they used. This computer, known as Beagle and based at Argonne National Laboratory, is able to analyze 240 full genomes in about two days, according to the researchers.
Although the time and cost of sequencing an entire human genome has plummeted, most current approaches to analyzing the resulting three billion base pairs of genetic information from a single genome can take many months.
“This is a resource that can change patient management and, over time, add depth to our understanding of the genetic causes of risk and disease,” said study author Elizabeth McNally, M.D., Ph.D., the A. J. Carlson Professor of Medicine and Human Genetics and director of the Cardiovascular Genetics clinic at the University of Chicago Medicine.
“The supercomputer can process many genomes simultaneously rather than one at a time,” said first author Megan Puckelwartz, a graduate student in McNally’s laboratory. “It converts whole-genome sequencing, which has primarily been used as a research tool, into something that is immediately valuable for patient care.”
The team published their study (“Supercomputing for the parallelization of whole genome analysis”) in Bioinformatics.
Because the genome is so vast, those involved in clinical genetics have turned to exome sequencing, which focuses on the 2% or less of the genome that codes for proteins. This method is often useful. An estimated 85% of disease-causing mutations are located in coding regions. But the rest, about 15% of clinically significant mutations, come from noncoding regions, once referred to as “junk DNA” but now known to serve important functions. If not for the tremendous data-processing challenges of analysis, whole-genome sequencing would be the method of choice, claim the University of Chicago scientists.
To test the system, Dr. McNally’s group used raw sequencing data from 61 human genomes and analyzed that data on Beagle. They used publicly available software packages and one quarter of the computer’s total capacity. They found that shifting to the supercomputer environment improved accuracy and dramatically accelerated speed.
“We now adapted a Cray XE6 supercomputer [Beagle] to achieve the parallelization required for concurrent multiple-genome analysis. This approach not only markedly speeds computational time but also results in increased usable sequence per genome,” wrote the investigators. “Relying on publicly available software, the Cray XE6 has the capacity to align and call variants on 240 whole genomes in approximately 50 hours. Multisample variant calling is also accelerated.”
“Improving analysis through both speed and accuracy reduces the price per genome,” said Dr. McNally. “With this approach, the price for analyzing an entire genome is less than the cost of the looking at just a fraction of genome. New technology promises to bring the costs of sequencing down to around $1,000 per genome. Our goal is get the cost of analysis down into that range.”