Large-scale Data Integration
“One of the biggest challenges for systems biology is integration of data,” says Susie Stephens, Ph.D., principal product manager of life sciences at Oracle (www.oracle.com). “Many data types have to be correctly integrated into data models. Data models have to be frequently adapted to changing schema and new data sources.”
In recent years, the Semantic Web has emerged as a standards-based approach for flexible integration of heterogeneous data using explicit semantics. In 2004, the World Wide Web Consortium (W3C) approved two key Semantic website technologies, the Resource Description Framework (RDF) and the Web Ontology Language (OWL), as standards for data integration and analysis.
RDF represents data in the form of triples: subject, predicate, and object. As applied to systems biology, a triple may be . A unique identifier is assigned to each component of the triple. Triples are stored in a database and can be queried by relationships as well as by subject and object.
SPARQL is the query language that W3C is finalizing for querying triples. “Oracle supports open standards and recognizes the need for flexible data integration in drug development and biological research,” notes Dr. Stephens. “It is the only enterprise software company to have support for RDF in the database.”
Oracle Database 10g Release 2 provides support for a secure and highly available RDF data model. “Our support for RDF is characterized by scalable data storage and efficient querying,” adds Dr. Stephens.
The need for improved data analysis in computational biology is recognized by the Computational Biology Center at IBM(www.research.ibm.com).
“We focus on developing tools for LC-MS analysis that are able to deal with large quantities of data and large dynamic ranges, provide quantitative information on all peaks, and improve signal to noise resolution,” says Frank Suits, Ph.D., research staff member.
IBM’s approach emphasizes 2-D algorithms and ensemble methods in the analysis of LC-MS data. “We delay thresholding to avoid throwing away faint signals that may have biological significance. Proteins with low abundance may be important markers of biological processes,” adds Dr. Suits. The key method is the conversion of 2-D raw data that comes off a mass-spec into a mesh, where time (t) and mass/charge ratio (m/z) form a regular grid.
After that, IBM applies a novel peak detection algorithm, which results in unambiguous identification of all peaks in the sample, each defined by its height, volume, local background, width, m/z, and t. Next, the meshes are aligned in time by a novel 2-D warping procedure that corrects for differences in liquid chromatography between samples.
“The critical distinction of IBM’s methods is the sequence of the analysis. We find all peaks in LC-MS data first and only then apply warping. It has clear advantages in resolving ambiguities in identification and correlation of individual peaks,” says Dr. Suits.
Once the multiple meshes are aligned in time, the ensemble of samples allows correlated peaks to be visualized as clusters that differ by color (prevalence of one type of sample over another) and by size (strength of contributing peaks). Peaks resulting from random fluctuations and electronic noise will show as “sand”. “At this point, we can discard orphaned peaks with greater confidence,” concludes Dr. Suits.
MedImmune (www.medimmune.com) utilizes systems biology approaches to simulate interactions of a virus with the host organism. Its computational biology team sifts through available data and extracts relevant information about cellular networks and responses to viral invasion.