Sequencing Technologies Hold the Key
Different DNA sequencing technologies are used to sequence a human genome. Sequencing of clinical samples is driven by research requirements such as urgency, depth of coverage, and budget. Competing technologies are available—from Illumina, Complete Genomics, Life Technologies’ SOLiD, or emerging technologies such as Ion Proton or tools from Pacific BioSciences.
The key facets of genome sequencing analysis include: SNVs, copy-number variations (CNVs), more complex structural variants, and the regions showing loss-of-heterozygosity (LOH). Clinical and biological researchers apply these data types regularly to arrive at decisions, such as determining cancer treatment or characterizing patterns of genetic variation. In addition, it can be used to highlight clinically actionable genetic variants for purposes of molecular ancestry and comparative genomics, or highlight known drug metabolism variants.
“Regardless of the sequencing platform selected for the genome analysis, there is an urgent need to convert the sizeable volumes of digital data into an accessible format that can be used to direct subsequent research,” says Stephen Rudd, Ph.D., formerly with the Malaysian Genomics Resource Centre Berhad (MGRC), and currently head of computational biology at QFAB, Institute for Molecular Biosciences, Queensland Bioscience Precinct, The University of Queensland.
As yet, there is no ideal single workflow that performs all of the analyses in a streamlined fashion that is accessible to a researcher with limited bioinformatics expertise. The MGRC set out to address this dearth of researcher-friendly solutions by implementing a portable application pipeline that could be deployed close to the researcher.
“The One-Click Pipeline and Hotspot data presentation layer are part of MGRC’s portfolio of genome bioinformatics tools and are used to present content to clinical researchers, biologists, and support bioinformaticians,” says Dr. Rudd.
Genome One-Click entails application of a sequential pipeline of sequence transformations. Stringent quality control ensures that reads are mapped against the reference human genome using software such as SXMapper. The mapping process helps to understand the key metrics of variation detection, such as uniqueness of the sequence context, the local repetition, and depth of coverage.
Subsequent iterative steps help to characterize possible variants and to assign suitable parameters for the variation assignment. The results from the Genome One-Click are available as text files or BAM files. The technology can typically map a human genome sequenced to 50x coverage, with variations determined within 48 hours.
“At the heart of the One-Click Genome analysis pipeline is MGRC’s genome mapping software, built upon the core Synasuite set of bioinformatics tools and delivering a high-performance genome analysis without the need for computer clusters and oversized data centers,” explains Dr. Rudd.
Hotspot is a mutation-mining framework used to explore and characterize genome-scale data by curating the Genome One-Click results into a relational data structure. According to Dr. Rudd, “our one-click pipeline has mapped tens of pairs of genomes to the reference human genome and the resulting mapping data is stored in the Hotspot system.” Using password-protected online access to the genome data and a set of filters and controls, candidate features can be queried.
For example, queries can be built to identify novel SNPs located in miRNA genes in tumor samples but not in their paired normal samples. Or in other cases, experts can determine nonsynonymous protein substitutions that segregate between Chinese and Malay individuals, for instance. The variants within the databases acquire the confidence scores calculated by Genome One-Click. The results can be filtered for the strict canonical and ideal variants, or relaxed to accommodate some of the more speculative content, Dr. Rudd explains.
Analysis of a human genome and the required mapping of hundreds of millions of short sequence reads require a significant amount of computing power. The transfer of raw sequencing data can also be a challenge for an effective and fast turnaround time. Comparison of the software with other mapping software and variant-calling methods has shown that the MGRC approach to the analysis of whole-genome data is robust, cost-effective, rapid, and attended with good sensitivity and selectivity, says Dr. Rudd.
“It is especially useful for researchers who need to extract the essence from multiple human genomes or exomes but are not familiar with bioinformatics and the plethora of complex and not always interoperable open source solutions,” he says.
At MGRC, experts plan an integrated data analysis environment that could be provided globally—either as a cloud solution in a centralized data center, or as a standalone appliance that could be deployed in a hospital, university, or pharmaceutical company data center, according to Dr. Rudd. “Given that human genome sequencing can generate terabytes of data, our customers find it more convenient to ship us a hard disk drive, which we then return to them with the results of the analysis.”
The One-Click Pipeline and Hotspot were applied to the Malaysian MyGenome project involving 26 deep sequenced human genomes, delivering the first systematic survey of human genetic variation to the Malaysian Ministry of Science Technology and Innovation. Currently, the software suite is being used in projects offering genomics services for clinical and pharmaceutical customers in Europe and the U.S. to interpret the extensive data emerging from whole-genome sequencing.