September 15, 2012 (Vol. 32, No. 16)
Standardization Is the Key to Deriving Deep Insights from Biological Data
The technological landscape has been dominated during the past 50 years or so by electronics in various forms: digital computing, wired and wireless communication, miniaturization of components, and the like, all at costs that decrease according to Moore’s law: a reduction by 50% every 18 months or so.
But now it is widely believed that the next 50 years will belong to biology. The publication of the draft human genome in 2001 captured the imagination of the public, and there was great anticipation that post-genomic biology would be radically different from what went on earlier and lead to rapid advances in diagnostics and therapy.
In reality much of that promise still remains just that—promise. However, the dramatic reductions in cost that were earlier witnessed in the world of electronics are now manifesting themselves in the world of biology. Whereas the human genome project took about 10 years and cost around $3.5 billion to generate a rough draft that was only about 98% accurate, it is now possible to sequence individual human genomes at far higher accuracies for less than $5,000 each.
If one does not insist on sequencing an entire genome, but focuses on detecting mutations at specific locations in the DNA, the cost is even lower. This has encouraged the re-sequencing of a great many diseased tissues, especially in cancer.
For example, TCGA (The Cancer Genome Atlas) is an ambitious project to achieve comprehensive molecular characterization of every single cancerous tissue that is currently preserved, including the exome sequence, DNA copy number, promoter methylation, as well as expression analysis of messenger RNA and microRNA. The project has already resulted in a massive amount of information that can be used by the research community to fine-tune their diagnostic and therapeutic tools.
All of these advances have resulted in a subtle shift in the balance between data generation and data analysis. In earlier years biology was primarily viewed as an experimental science. However, nowadays biology is just as much a computational science as an experimental science. In other words, data must be turned into information, and information into actionable knowledge and experimentally testable hypotheses.
Data analysis is a natural activity for the engineering community and affords an opportunity for engineers to work hand in hand with biologists to develop new insights into disease mechanisms, to identify biomarkers that can predict which patients will respond to which therapy, and to provide mechanistic (cause and effect) explanations as to why these are biomarkers.
Going forward, the landscape of cancer patients will resemble a mosaic consisting of groups that are highly coherent within themselves but substantially different across groups, and treatments will be customized to each coherent group. It would be appropriate to refer to this approach as targeted medicine, though this is often mislabeled as personalized medicine.
The analysis of massive datasets poses as much of a challenge to engineers as to biologists, because many of the currently popular methods in engineering will simply not work when applied to biological datasets. One important difference is that many engineering datasets are characterized by a very large number of samples and a far smaller number of features.
For example, in order to train a machine to recognize faces, it is easy to acquire a million facial images, whereas the number of features that are extracted from each image will be of the order of a hundred. The situation is just the inverse in biological datasets, where the number of features is a few orders of magnitude larger than the number of samples.
To illustrate, a whole genome expression study of tissues from a particular form of cancer will generate about 20,000 expression levels (features) for each tissue, while the number of tissues (samples) will at best be in the hundreds, and often in the dozens. The challenge in this situation is to identify a handful of key features that can serve to cluster the samples into coherent classes. It is necessary to develop new algorithms that are specifically tailored to address this inverted situation. My students and I have developed one such algorithm, somewhat whimsically named lone star, but much work remains to be done.
Any attempts to replicate machine learning algorithms across multiple biological datasets, even when they ostensibly pertain to the same form of cancer, will have to squarely confront the lack of standardization in the manner in which such data is generated. Multiple platforms are used to generate the data, and the numbers generated cannot always be harmonized with each other. Complicating the situation is the post-processing of the raw data, often described as normalization.
Different groups use different methods of normalization, and the papers do not always describe the methods used in sufficient detail. As a result, even when different datasets are allegedly generated using the same platform, there is no internal consistency to the numbers. The enunciation and universal adoption of standards is one of the great accomplishments of the engineering community, and if we are able to convey some of that philosophy to biologists, then we would have rendered a significant service.
Identifying key features using data analysis would not, by itself, satisfy biologists, and rightly so. The mathematician Rene Thom has said that the ability to make accurate predictions should not be confused with understanding the underlying structure. To put it more plainly, it is not enough to know which genes are biomarkers; one must also know why they are biomarkers.
Toward this end, the computational biology community has developed several algorithms for reverse-engineering context-specific, genomewide networks from expression data. Almost all of these methods make use of quite advanced methods from probability and statistics, including information theory.
It is interesting to note that information theory, which was originally developed in 1949 to analyze the propagation of electrical signals along telephone wires, is now being applied so effectively to problems in biology. The reverse-engineered networks can be used to derive deep insights from the wealth of mutation data that is being generated through TCGA.
Available data has shown that mutations in cancerous tissues are far less common than the biology community had supposed. Many genes were found mutated in just one single sample tested. Since it is widely accepted that mutations (or at least, alterations in gene functioning) are the key to understanding the onset and progression of cancer, one is therefore led to look for genomic machines—that is, collections of genes that act in concert to perform a common set of functions.
It can be hypothesized that mutations in any one of the constitutent genes in a machine can lead to the functioning of the machine going awry, potentially causing cancer. Thus the genomewide networks give the community a powerful tool to develop a better understanding of the role of mutations in cancer.
Mathukumalli Vidyasagar is the Cecil & Ida Green Chair in Systems Biology Science, University of Texas at Dallas. He is also co-chair of the IEEE Life Sciences Initiative.