New genomics techniques like next-generation sequencing produce overwhelmingly large datasets that present a challenge in data analysis. Some solutions are available for data storage and transfer, but analytical manipulation adds an extra computational burden. Common data-analysis techniques such as normalization, when applied to a huge dataset, output a dataset that overwhelms the memory on a typical computer, causing CPU performance to drop.
Marc Bouffard, senior bioinformatician, Montreal Heart Institute and Genome Quebec Pharmacogenomics Center, has developed a data-analysis system called CASTOR QC (comprehensive analysis and storage) that solves some of the most common problems in the analysis of genomic data: data redundancy, data formatting, and the use of flat files.
“The idea is to move away from the old-school way of doing things,” says Bouffard. “With smaller studies, people have been able to look at their data—open a file, page down. With the newer studies, that’s just not practical. You really need to move to a data format that is optimized for computer or automated processing.”
Bouffard began developing CASTOR QC because the Montreal Heart Institute wanted a faster way to analyze data from a project examining the toxicity of statins and other lipid-lowering drugs. The project included 7,000 patients with a half-million SNPs each. The conventional analysis methods tended to be memory bound, only processing properly if the file could fit into memory.
Supercomputing solutions such as cloud computing offer some relief from this bottleneck. However, many genomic datasets including SNP studies and next-generation sequencing produce datasets so large that the bandwidth limits the transfer of data into and out of the database.
“We basically started from the beginning and looked at the data to find a way to store it without losing any information in a format that is optimized for analysis,” says Bouffard. “So far the results are very good.” A side benefit of this more efficient method of analyzing data is that the datasets are much smaller—as little as 15% of the original size—which reduces the cost of data storage.
A presentation by Joshua Millstein, Ph.D., biostatistics, senior scientist, statistical genetics, Sage Bionetworks, will show how integrative genomics can be used to intensively characterize psychiatric phenotypes in sleep disorders in a genome-wide microarray expression study, with the ultimate goal of identifying targets and biomarkers for therapy. Dr. Millstein used a line of inbred mice to identify genes that associate with a sleep trait that is also associated with depression.
“We’re also placing these genes in the context of cell-wide networks of gene expression and identifying regions of these networks that are associated with clinical traits of interest. In this way we can place these genes in the context of global expression and determine whether they’re likely to cause cell-wide changes in expression or influence important genes,” says Dr. Millstein.
To perform the data analysis, the team built co-expression networks using soft thresholds to determine relationships between genes, and to identify co-expressed modules—groups of genes that tend to be expressed as a group between individuals. Bayesian modeling techniques incorporated genotype information. They also used a causal driver method, a network-based approach to identify nodes in the network that influence other nodes.
Collecting the data involved a labor-intensive quality-control methodology. The EEG and EMG sleep data was collected over a period of 48 hours for each mouse, but the experiment spanned the course of an entire year. This meant a scientist had to check each dataset to ensure that it mapped correctly, and follow up on any unusual patterns. RNA quality was crucial, because RNA degrades rapidly. “The standard expression in statistics is garbage in, garbage out. It’s very hard to generate a high-integrity dataset with so many components. Every component needs to be high quality for the end result to be reliable,” says Dr. Millstein.