“The future of next-generation computing is not in the cloud,” argued Matthew R. Keyser, NSG applications specialist, DNASTAR. “In the near future genome sequencing will no longer be tied up in core facilities. NGS will be performed at the bedside, and analysis will be done on a laptop.”
DNASTAR provides a suite of genomic applications, including assembly and alignment algorithms that could be efficiently run on any personal computer. DNASTAR workflows are compatible with most next-gen sequencers including those from Illumina, Roche, and Ion Torrent.
“Our proprietary software algorithms are built to maximize hardware and memory usage,” continued Keyser. “DNASTAR enhanced processing power means that a 4.6 MB E. coli genome could be assembled in less than seven minutes on any Windows or a Mac desktop computer. Most of the open-source assembly and analysis software requires investment in Linux. And DNASTAR’s processing speed is achieved without the need for multiparallel computation.”
Comparison of six different assemblers performed by the Institute of Evolutionary Biology at the University of Edinburgh found that the DNASTAR SeqMan assembler generated a large proportion of novel sequences and resulted in the best alignment to the reference sequences. SeqMan capabilities are readily exploited for metagenome analysis, as exemplified by a study of the viromes of three North American bat species.
SeqMan was one of the three assemblers used in this study that identified several novel coronaviruses out of a pool of viruses. The company has just been awarded an NIH grant to further enhance a metagenomics analysis pipeline for SeqMan. “Our next challenge is automation of microbial genome assembly,” said Keyser.
“Bacterial genomes are surprisingly difficult to complete, especially for a novel organism. Open-source assemblers are not capable to resolve the repetitive areas, meaning that many gaps have to be ‘closed’ manually.
“Moreover, most of the other assemblers provide text files as the output, whereas SeqMan produces a fully editable project file, which allows the end user to edit individual sequences, edit contigs (split, merge), order contigs into scaffolds, and use specialized alignment algorithms to close gaps. I am not aware of any other software that provides as complete an interface for microbial genome assembly, gap closure, and annotation.”
Ready for Medical Grade?
“One of the big problems with big data is lack of quality standards and, therefore, lack of performance metrics such as accuracy of assemblers, accuracy of genotyping calls, detection limits of variants, etc.,” said Justin Johnson, director of bioinformatics, EdgeBio. “How do we know that the next-gen sequencing results are, indeed, accurate?”
EdgeBio leads the development of the validation protocol underwritten by the X Prize Foundation, a nonprofit organization that creates and manages global competitions to solve challenges facing humanity. The Archon Genomics Xprize presented by Express Scripts, a $10 million award, will be given to the first team to sequence the genomes of 100 centenarians in 30 days cheaply, accurately, and completely.
The genomes must be sequenced with an error rate of one in one million bases. At this level of quality, the resulting sequences are moving toward “medical-grade”, meaning that the data may be used in clinical care decisions. The purpose of the validation protocol is first to develop an answer key, and second to create an automated scoring system against the answer key.
To create the answer key, EdgeBio made 5,000 fosmids (cloned portions of the genome, about 200 MB) from two well-known reference samples, Yoruba Male and CEU Female. The fosmids were sequenced by three different methods to reveal the extent of the bias due to a particular sequencing platform.
“About 15% of single nucleotide polymorphisms (SNPs) can be attributed to sequencing technologies,” continued Johnson. “We evaluated the discordance between platforms and used multiple statistical algorithms to annotate true positives and true negatives.”
Next, EdgeBio developed software to compare the answer key with other sequencing results from the same two reference samples. The algorithm scores the results and produces the quality report. The company integrated the upload of test sequences, comparison, scoring, and reporting into a workflow with an intuitive interface (www.validationprotocol.org).
Even before the XPrize, EdgeBio was deeply invested in clinical sequencing and received CLIA certification in 2012.
“While the significance of whole genome is still not quite established, medical-grade sequencing of exomes, targeted gene pools, or transcriptomes may provide clinically actionable information,” said Johnson. “Development of performance metrics will speed up the incorporation of next-gen technologies into clinical diagnostics.”
“Simply annotating and aligning DNA sequences is not enough to discover their biomedical value,” said Martin Seifert, Ph.D., CEO, Genomatix. “To perform meaningful analysis of their sequencing data, the researchers need to view it in combination with existing biological knowledge.”
The Genomatix Genome Analyzer (GGA) enables visualization of NGS data in a context of multiple databases containing a comprehensive compilation of information on transcriptional regulation, DNA binding sites, epigenomic spots, and signaling networks.
“Knowledge datasets are available for 33 different organisms adding up to several terabytes of data. Cross-organism comparisons help assign meaning to genetic elements for which the function is not yet understood.”