As high-throughput screening technologies are producing increasingly vast quantities of data, correspondingly robust data-analysis tools are necessary to help scientists sift through it for usable information. One solution to the problem is to use bigger, more powerful computers. However, software techniques for data integration and analysis can make huge datasets more manageable for a typical desktop PC. These methods will take center stage at CHI’s upcoming “Bio-IT World” conference.
Developing better tools for the analysis of small molecule screening data will be the subject of a presentation by the Broad Institute’s Raza Shaikh, Ph.D., associate director of informatics for the chemical biology platform. Broad received a $100 million, five-year grant as part of the Molecular Libraries Probe Production Centers Network (MLPCN) to develop data-analysis tools for Pubchem data. The Molecular Libraries Program seeks to bring the tools of high-throughput screening, widely used in the pharmaceutical industry, to public-sector science.
One of the difficulties in dealing with such a volume of data is in hit calling and decision making. If an assay is tested with 300,000 compounds, thousands of positive hits will result. Of these positive hits, many will be false positives or will ultimately be biologically irrelevant. Smarter tools for sorting out these hits streamline the decision-making process.
“The motivation behind the small project that we did is to allow that knowledge search and decision making to happen quickly and in an automated fashion. Rather than manually searching each of these assays one-by-one in Pubchem, our goal is to reduce two days’ worth of work to fifteen minutes,” says Dr. Shaikh.
The program produces a real-time report of compound inhibition in known assays. The greatest challenge, adds Dr. Shaikh, is the variety and inconsistency of the source data. “When people submit data to Pubchem, there are countless variations in terminology that have to be individually addressed within the software.”
New genomics techniques like next-generation sequencing produce overwhelmingly large datasets that present a challenge in data analysis. Some solutions are available for data storage and transfer, but analytical manipulation adds an extra computational burden. Common data-analysis techniques such as normalization, when applied to a huge dataset, output a dataset that overwhelms the memory on a typical computer, causing CPU performance to drop.
Marc Bouffard, senior bioinformatician, Montreal Heart Institute and Genome Quebec Pharmacogenomics Center, has developed a data-analysis system called CASTOR QC (comprehensive analysis and storage) that solves some of the most common problems in the analysis of genomic data: data redundancy, data formatting, and the use of flat files.
“The idea is to move away from the old-school way of doing things,” says Bouffard. “With smaller studies, people have been able to look at their data—open a file, page down. With the newer studies, that’s just not practical. You really need to move to a data format that is optimized for computer or automated processing.”
Bouffard began developing CASTOR QC because the Montreal Heart Institute wanted a faster way to analyze data from a project examining the toxicity of statins and other lipid-lowering drugs. The project included 7,000 patients with a half-million SNPs each. The conventional analysis methods tended to be memory bound, only processing properly if the file could fit into memory.
Supercomputing solutions such as cloud computing offer some relief from this bottleneck. However, many genomic datasets including SNP studies and next-generation sequencing produce datasets so large that the bandwidth limits the transfer of data into and out of the database.
“We basically started from the beginning and looked at the data to find a way to store it without losing any information in a format that is optimized for analysis,” says Bouffard. “So far the results are very good.” A side benefit of this more efficient method of analyzing data is that the datasets are much smaller—as little as 15% of the original size—which reduces the cost of data storage.
A presentation by Joshua Millstein, Ph.D., biostatistics, senior scientist, statistical genetics, Sage Bionetworks, will show how integrative genomics can be used to intensively characterize psychiatric phenotypes in sleep disorders in a genome-wide microarray expression study, with the ultimate goal of identifying targets and biomarkers for therapy. Dr. Millstein used a line of inbred mice to identify genes that associate with a sleep trait that is also associated with depression.
“We’re also placing these genes in the context of cell-wide networks of gene expression and identifying regions of these networks that are associated with clinical traits of interest. In this way we can place these genes in the context of global expression and determine whether they’re likely to cause cell-wide changes in expression or influence important genes,” says Dr. Millstein.
To perform the data analysis, the team built co-expression networks using soft thresholds to determine relationships between genes, and to identify co-expressed modules—groups of genes that tend to be expressed as a group between individuals. Bayesian modeling techniques incorporated genotype information. They also used a causal driver method, a network-based approach to identify nodes in the network that influence other nodes.
Collecting the data involved a labor-intensive quality-control methodology. The EEG and EMG sleep data was collected over a period of 48 hours for each mouse, but the experiment spanned the course of an entire year. This meant a scientist had to check each dataset to ensure that it mapped correctly, and follow up on any unusual patterns. RNA quality was crucial, because RNA degrades rapidly. “The standard expression in statistics is garbage in, garbage out. It’s very hard to generate a high-integrity dataset with so many components. Every component needs to be high quality for the end result to be reliable,” says Dr. Millstein.