February 1, 2017 (Vol. 37, No. 3)
Already Losing Their Grip, Researchers Are Turning to High-Performance Computing
Today, the data deluge is impacting scientists and researchers in genomics and other life sciences organizations in a very profound way. First, they are unable to manage an avalanche of data being generated by more and more sources; and second, the situation is compounded because they lack the computing capacity and power to turn all this data into real scientific insights.
We are at an inflection point in the field of genomics. The cost of sequencing a human genome is less than $1,000 today, and expected to drop even further (compared to the $3 billion cost in 2003). With the cost of sequencing declining and the processes becoming more common, data is on the rise as a single human genome “run” produces half a terabyte of raw data image files. These files are complex, with highly granular, unstructured scientific data that is difficult to manage and analyze.
With the advancement of technology in the past decade, scientists are challenged to manage this overwhelming amount of unstructured genomics data being generated by a whole universe of academic, clinical, and pharmaceutical research. For example, many organizations are now requiring more advanced data analysis and management for cases such as drug development, identifying the root causes of diseases, and creating personalized treatments in clinical applications. The modern approach to sequencing the genome is a complex, multistep process including sequence reads, sequence assembly to map the genome, sequence analysis for variants, and resequencing.
The reality is: the technology we have been using for the past decade is not powerful enough to analyze this crucial data. It certainly is not capable of working in the coming years, as companies continue to innovate and our need to analyze massive volumes of data grows. More people will have their genome sequenced: from a few thousand today to millions of people within the next decade. As a result, the key for technologists is to make a future possible where all of this data can be analyzed through advances in sophisticated high-performance computing (HPC) or supercomputing and big data technology.
Big Data—An Issue in Itself
The pursuit of personalized medicine is generating an explosion of data, as physicians and researchers aim to identify the best course of treatment for any individual based on that person’s particular expression of a disease and tolerance for treatment. And personalized medicine is growing, as funding of genomics research is on the rise, genetic testing is being commercialized and some insurance providers now cover genome sequencing.
To provide an example of the data problem, one provider—Kaiser Permanente—began a nationwide push to collect DNA samples, medical records, and answers to questionnaires from more than 210,000 patients, subsequently creating one of the largest and most comprehensive national research banks for precision medicine data in the world. With this initiative, Kaiser and University of California, San Francisco researchers hope to identify specific genes that influence a variety of genetic disorders, with an end goal of using their findings to improve diagnostics, treatment, and prevention.
In order to comb through such complex, highly granular, unstructured scientific data successfully, scientists require massive computing power, high-speed analytics, and flexibility. Legacy computing systems have not been able to keep pace with the demands of data because they lack the horsepower to move these volumes of data and they are unable to scale alongside the pace of data growth.
It may seem overwhelming, but with the modern supercomputing technology in place, research organizations can continually increase data volumes and generate useful scientific insights.
Managing and Sharing New Data Insights
To achieve scientific breakthroughs in today’s data-intensive world, large datasets must be more easily and quickly analyzed for research teams. Last year, the Inova Transitional Medicine Institute (ITMI) purchased an HPC system so researchers could glean new insights from its premier genomic databases, allowing them to diagnose patients with more accuracy and speed, and ultimately ensuring a higher level of treatment and care.
The ITMI system was used to power data-intensive workloads of 25,000 genomes and simplify data management by enabling researchers to develop and use their own code, rather than adapting to a more generic program that didn’t fit its needs as well. With this flexibility, ITMI’s administrative IT burden was significantly reduced while simultaneously increasing research workflow capabilities, enabling the organization to devote greater resources to their patients’ challenging chronic diseases.
As scientific organizations grapple with data overload, many will also invest in new supercomputing solutions to improve management and accessibility of the data. More specifically, these systems will provide accelerated workflows and faster assembly and analysis operations, which ultimately offer researchers considerably faster time to discovery. HPC systems query massive databases an order of magnitude faster, explore much larger datasets, and allow researchers to undertake many more data investigations simultaneously.
Why Data Storage Is Crucial
One of the biggest challenges in genomics research is that the datasets need to be stored, analyzed, and then stored again. To put into perspective, Human Longevity recently partnered with AstraZeneca to sequence and analyze up to 500,000 DNA samples from clinical trials. The program is slated to generate one million integrated health records with genome, molecular, and clinical data by 2020—an astounding amount of data—all of which must be stored externally to later be brought across the network into a computer, analyzed, and then driven back to external storage. This process places an incredible burden on a traditional IT infrastructure.
Most storage-management infrastructures were not built to manage the strains of these workloads. They cannot support the scalability, sustained performance, and long-term survivability that today’s massive biomedical applications demand.
The Modern Era of the Supercomputer
Genomics research is poised to continue its data explosion as technologists deliver petascale—and soon exascale—solutions; volumes of data that were unfathomable just a few years ago. Tackling these big data problems is not for the faint hearted, but the good news is that supercomputing systems have become more affordable and less complex.
Supercomputers are capable of multiple functions in genomics, including assisting in organizing and recognizing patterns within research data, and annotating genetic sequences to image modeling.
It’s important for research institutions to look for modern HPC solutions that can not only analyze data, but also easily store data while making it accessible again to other researchers. As customer needs constantly change to adapt to the amount of incoming data, the technology must adapt with them. At SGI, we offer effective storage systems that will easily integrate with high-performance computing and data analytics systems capabilities.
Modern HPC systems provide a large-scale, storage virtualization, and data-management platform specifically engineered to control the enormous amounts of structured and unstructured content generated by life sciences applications. In the race to collect, study, link, and analyze critical biomedical research data associated with personalized medicine environments, SGI makes it easier for research institutes and labs to accelerate the road to successful analyses and innovation.
All of these attributes will facilitate the advancement of identifying the origins of diseases, speed the identification of biomarkers, and make it easier to apply personalized and more targeted treatments to patients. Researchers are challenged to produce new, quality research to help scientists get one step closer to creating personalized drug discovery or therapy, and cures for cancers and other harmful diseases through genomic sequencing and stem-cell research. HPC systems are at the forefront, allowing leading research organizations to deliver breakthrough discoveries in life sciences.
Gabriel Broner ([email protected]) is the vp and general manager of high-performance computing at SGI.