Systems biology is the study of the interactions between the components of a biological system and how these interactions give rise to the function and behavior of that system. The objective is to model as many interactions as possible at the molecular, cellular, and whole-organism levels. System biology assembles and integrates complex data from a variety of sources and experimental techniques. Computational modeling plays a key role in the systems biology approach, as it allows observation of the ripple effects of a biological, chemical, or genetic perturbations.
Large-scale Data Integration
“One of the biggest challenges for systems biology is integration of data,” says Susie Stephens, Ph.D., principal product manager of life sciences at Oracle (www.oracle.com). “Many data types have to be correctly integrated into data models. Data models have to be frequently adapted to changing schema and new data sources.”
In recent years, the Semantic Web has emerged as a standards-based approach for flexible integration of heterogeneous data using explicit semantics. In 2004, the World Wide Web Consortium (W3C) approved two key Semantic website technologies, the Resource Description Framework (RDF) and the Web Ontology Language (OWL), as standards for data integration and analysis.
RDF represents data in the form of triples: subject, predicate, and object. As applied to systems biology, a triple may be . A unique identifier is assigned to each component of the triple. Triples are stored in a database and can be queried by relationships as well as by subject and object.
SPARQL is the query language that W3C is finalizing for querying triples. “Oracle supports open standards and recognizes the need for flexible data integration in drug development and biological research,” notes Dr. Stephens. “It is the only enterprise software company to have support for RDF in the database.”
Oracle Database 10g Release 2 provides support for a secure and highly available RDF data model. “Our support for RDF is characterized by scalable data storage and efficient querying,” adds Dr. Stephens.
The need for improved data analysis in computational biology is recognized by the Computational Biology Center at IBM(www.research.ibm.com).
“We focus on developing tools for LC-MS analysis that are able to deal with large quantities of data and large dynamic ranges, provide quantitative information on all peaks, and improve signal to noise resolution,” says Frank Suits, Ph.D., research staff member.
IBM’s approach emphasizes 2-D algorithms and ensemble methods in the analysis of LC-MS data. “We delay thresholding to avoid throwing away faint signals that may have biological significance. Proteins with low abundance may be important markers of biological processes,” adds Dr. Suits. The key method is the conversion of 2-D raw data that comes off a mass-spec into a mesh, where time (t) and mass/charge ratio (m/z) form a regular grid.
After that, IBM applies a novel peak detection algorithm, which results in unambiguous identification of all peaks in the sample, each defined by its height, volume, local background, width, m/z, and t. Next, the meshes are aligned in time by a novel 2-D warping procedure that corrects for differences in liquid chromatography between samples.
“The critical distinction of IBM’s methods is the sequence of the analysis. We find all peaks in LC-MS data first and only then apply warping. It has clear advantages in resolving ambiguities in identification and correlation of individual peaks,” says Dr. Suits.
Once the multiple meshes are aligned in time, the ensemble of samples allows correlated peaks to be visualized as clusters that differ by color (prevalence of one type of sample over another) and by size (strength of contributing peaks). Peaks resulting from random fluctuations and electronic noise will show as “sand”. “At this point, we can discard orphaned peaks with greater confidence,” concludes Dr. Suits.
MedImmune (www.medimmune.com) utilizes systems biology approaches to simulate interactions of a virus with the host organism. Its computational biology team sifts through available data and extracts relevant information about cellular networks and responses to viral invasion.
The textual knowledge is then structuralized to construct large-scale predictive gene networks and protein interaction networks. Data is aggregated in MedImmune’s bioinformatics database, which serves to characterize the role of the gene or protein in the viral disease process, to elucidate integrative mechanisms that underlie viral diseases, and to identify potential viral vaccine targets.
“Our database presents an integrated view of the entire spectrum of virus-host interactions, including effects of viral mutations or stages of viral infection,” says Qing Yan, M.D., Ph.D., a bioinformatics scientist at MedImmune. “We apply computational modeling and simulation to visualize viral infection on molecular, cellular, and whole-organism levels.”
George W. Kemble, Ph.D., vp of research and development and general manager of California operations, adds, “Development of viral vaccines continues to be an important part of MedImmune’s development strategy. As we now understand, a virus can interact with the host using multiple mechanisms. For example, six segments of the influenza virus could interact with three human proteins, leading to the same end result, apoptosis.
“Such mechanisms preclude efficient development of a small molecule drug against a virus. However, our database would help us to understand how an attenuated virus affects the cell and could lead to the development of more efficient antiviral therapies.”
“Ultimately, it is all about improving patient outcomes and lowering healthcare costs,” says David W. Moskowitz, M.D., CEO of GenoMed (www.genomedics.com). “Most current chemotherapy drugs act on the downstream components in the cancer cascade. We propose to manage disease by specifically affecting the upstream components. Blocking an early step with only 50% efficiency may be more effective clinically than blocking a later step with 99% efficiency.”
The company’s research is based on the assumption that polygenic adult diseases result from subtle changes in regulatory regions. “The biggest fallacy of genomic research is applying the rules of Mendelian genetics to a complex disease and thus to keep looking for a critical mutation in the coding region,” adds Dr. Moskowitz.
GenoMed predicts that as many as 5,000–7,500 genes could be involved in development of a cancer. Its Healthchip® is a comprehensive database of SNPs unique for each type of common cancer: breast, colon, lung, ovarian, pancreas, and prostate in caucasians. These SNPs are generally located within the first 2 kB upstream of the transcription start site.
The company applies a genomic epidemiology approach to determine the odds ratio, or prevalence of the mutation in the diseased population versus normal population. An SNP with odds ratio greater than one causes the disease, whereas an odds ratio less than one protects against the disease. For each cancer, GenoMed identified over 400 SNPs with odds ratio greater than five, which represent the most attractive targets for drug development.
“The genes that our SNPs regulate make up the biological pathways of cancer causation, that is, the systems biology of cancer,” adds Dr. Moskowitz. “They are the earliest steps in the cancer cascade.” GenoMed’s first study identified angiotensin I-converting enzyme (ACE), and angiotensin II, the product of the angiotensin I conversion, as early steps in development of most cancers, kidney failure, HIV infectivity, neurodegenerative diseases, diabetes, and autoimmune diseases.
Treatment with angiotensin II receptor blockers demonstrated favorable clinical outcomes in patients with chronic psoriasis. “Our business model at this point is a peer-reviewed, virtual pharmaceutical company™. We involve academics and small biotech companies to help with drug design and development. Independent expertise and hundreds of high-quality drug targets increase our chances for success,” concludes Dr. Moskowitz.
“Most of today’s compounds are rarely aimed at the underlying causes of the disease,” agrees Jeff Gulcher, M.D., Ph.D., CSO of deCODE Genetics (www.decode.com). “The basic biology and pathogenesis of diseases common in the population, such as heart attack or asthma, are complex. Multiple pathways and environmental factors contribute to the disease. Subtle deviations from the baseline may be enough to result in pathology.”
deCODE Genetics utilizes population-wide genomic studies to identify key genes that tag critical pathways involved in modulating risk of common diseases. “We have established the largest genotyping laboratory in the world,” continues Dr. Gulcher.
“We scan thousands of genomes for microsatellite markers and overlap the findings with the family genealogy and medical history. Once the chromosomal region is identified, we proceed with association studies to identify common gene variants conferring significant risk of disease. Then, we develop an understanding of how these risk variants perturb the activity of the biological pathway.”
The company’s reference library is composed of over 100,000 DNA samples from the Icelandic population. The findings are replicated in other populations. To date, deCODE has isolated 15 genes and drug targets in 12 common diseases and mapped genes in some 16 more.
It also discovers small molecule compounds that can correct the activity of the pathway in question. DG051, a first-in-class small molecule drug developed by deCODE’s chemistry group, is an inhibitor of leukotriene A4 hydrolase (LTA4H). It is being developed for the prevention of heart attacks and is currently in Phase I testing.
“Our genetics program identified two potential targets in the leukotriene B4 (LTB4) pathway,” says Mark Gurney, Ph.D., senior vp of drug development. “LTB4 contributes to inflammation, increasing the propensity of the plaque to rupture. Certain noncoding polymorphisms result in the increase in expression of LTA4H, leading to overproduction of LTB4. Depending on the ethnic background, this increases the risk of heart attack by 20 to 40%, or even 100%.”
LTB4 is a part of the innate immunity response, and its overproduction was beneficial in protecting humans from the pandemic bacterial infections. In the modern world, the same process seem to exacerbate the inflammation within atherosclerotic plaque. DG051 inhibits LTA4H and reduces LTB4 levels in a dose-dependent manner.
“There is no simple assay for measuring LTB4,” adds Dr. Gulcher. “However, genetic variants can be typed reliably. Physicians can use genotyping to determine whether a patient is at increased risk of heart attack due to LTB4 overproduction and then prescribe a drug like DG051 to specifically mitigate this risk.”