Patricia F. Fitzpatrick Dimond Ph.D. Technical Editor of Clinical OMICs President of BioInsight Communications

Data Mining Unearths Pathways and Networks That Yield Clinical Gold

During a panel discussion among scientists at the World Economic Forum in Davos, Switzerland, in January 2016, moderator U.S. Vice President Joe Biden asked for examples of obstacles researchers and clinicians face in the effort to cure cancer. While several topics emerged, the big issue was Big Data—more particularly, the collection, analysis, and application of Big Data.

The “Big” in Big Data may be taken to refer to the size of the datasets that are being amassed, or the importance of what these datasets, properly analyzed, might reveal. In either case, Big Data in practice amounts to the analysis of huge datasets to identify trends, find associations, and spot patterns.

Big Data is effective, some researchers say, because there is simply so much information available that can be analyzed. Large sample sizes, they point out, may reveal details that would normally go unnoticed in smaller sample sizes. Other researchers, however, contend that Big Data needs more than, well, lots of data. One such researcher is Keith Perry, the senior vice president and chief information officer at St. Jude Children’s Research Hospital in Memphis, TN.

When Mr. Perry was still working at the MD Anderson Cancer Center in Houston, TX, he was quoted in an institutional newsletter as follows: “Big Data is not just ‘big.’ The term also implies three additional qualities: multiple varieties of data types, the velocity at which the data is generated, and the [degree to which voluminous datasets are integrated].”

“Many of our databases currently don’t interface with each other because they’re generated by and housed in  separate prevention, research, and clinical departments,” added Mr. Perry, contrasting the reality of these disparate structures with the potential of a centralized platform.

Another researcher who believes that size is not all that matters is Narayan Desai, Ph.D., a computer scientist at communications company Ericsson in San Jose, CA. He was quoted in a 2015 news article in Nature as follows: “Genomics will have to address the fundamental question of how much data it should generate.”

“The world has a limited capacity for data collection and analysis, and it should be used well,” he continued. “Because of the accessibility of sequencing, the explosive growth of the community has occurred in a largely decentralized fashion, which can’t easily address questions like this.”

Hidden Weaknesses

Recently, some scientists have proposed that making more focused and creative use of existing data could provide clinical guidance with the data we have now. For example, Nevann Krogan, Ph.D., a professor of cellular and molecular pharmacology at the University of California, San Francisco (UCSF), has argued that genomics has brought us closer to a revolution in cancer treatment than most geneticists even realize.

“The sequencers say we just need to pour more money into sequencing and the answer will become clear,” said Dr. Krogan. “We say no. We’re actually done. We’ve reached a point of saturation in terms of the information we can extract.”

This quote, from an interview for the University of California San Francisco News, was accompanied by references to a “tsunami” of genetic data about different cancers. Despite the rising tide of data, Dr. Krogan complained, breakthroughs in cancer treatment have been slow to materialize. Part of the problem, he suggested, was that the piles of new data have only gone to show cancer’s staggering diversity: “Even a single tumor can contain a unique profile of thousands of genetic mutations, leaving researchers to figure out which are disease drivers and which are just along for the ride.”

Dr. Krogan and colleagues believe that instead of amassing more data, researchers need to look harder at the connections hidden in the data they already have. In concert with researchers from the University of California, San Diego (UCSD), Dr. Krogan launched The Cancer Cell Map Initiative (CCMI). It was announced May 21, 2015 in Molecular Cell.

The CCMI is aimed at systematically detailing the complex interactions among cancer genes and how they differ between diseased and healthy states, thereby producing a “wiring diagram” of normal and mutated genes and proteins within a cancer cell. Other partners in the enterprise include the Gladstone Institutes in San Francisco, the Clinical and Translational Research Institutes at UCSD and UCSF, and Thermo Fisher Scientific.

Tumor Profiles

The CCMI combines the expertise at UCSD in extracting knowledge from big biomedical datasets with advances developed at UCSF for experimentally investigating the structure and function of cells. “We have the genomic information already,” asserted Trey Ideker, Ph.D., chief of medical genetics in the UCSD Department of Medicine and founder of the UCSD Center for Computational Biology and Bioinformatics. “The bottleneck is how to interpret the cancer genomes.”

In his presentation at the Festival of Genomics in 2015 at UCSD, Dr. Ideker pointed out that the massive DNA sequencing underway for cancer has now approached over 20,000 genomes. But, he added, it remains difficult to analyze cancer genomes without knowledge of gene networks, as “no two patient tumors look alike at the gene level.”

Dr. Ideker and his colleagues, along with many others, believe that bioinformatics will help address this complexity.  

In a paper that appeared September 2013 in Nature Methods, Dr. Ideker and colleagues noted that The Cancer Genome Atlas (TCGA) and the International Cancer Genome Consortium (ICGC) had already started systematically profiling thousands of tumors at multiple layers of genome-scale information, “including mRNA and microRNA expression, DNA copy number and methylation, and DNA sequence.” Efforts such as TCGA and ICGC, the authors suggested, should encourage even more ambitious work.

“There is now a strong need for informatics methods that can integrate and interpret genome-scale molecular information to provide insight into the molecular processes driving tumor progression,” the authors insisted. “Such methods are also of pressing need in the clinic, where the impact of genome-scale tumor profiling has been limited by the inability to derive clinically relevant conclusions from the data.” 

Subnetwork Analyses

To approach this need for integrative informatics methods, the UCSD group and others have integrated gene-expression measurements over sets of genes encoding proteins known to interact within protein subnetworks or pathway databases. Rather than listing individual genes or proteins, such profiles represent the aggregate expression of subnetworks of genes or proteins within a vast interaction network.

These subnetworks, researchers say, can identify gene-expression differences between different populations of patients that account for their diverse clinical behavior. Subnetwork analysis, unlike conventional analysis, is capable of interpreting gene-expression differences in the context of networks and pathways. While this approach requires considerable bioinformatics, statistical, and institutional muscle, it uses data that’s already out there.

Back in 2007, in an article that appeared in the journal Molecular Systems Biology, Dr. Ideker and colleagues Han-Yu Chuang, Ph.D., Yu-Tseung Lu, Ph.D., and Eunjung Lee, Ph.D., noted that although genes with known breast cancer mutations are typically not detected through analysis of differential expression, they play a central role in the protein network by interconnecting many differentially expressed genes. Dr. Ideker and colleagues reported that in breast cancer analyses, subnetwork markers proved more reproducible than individual marker genes that were selected without network information. These subnetwork markers, the investigators wrote, “achieve higher accuracy in the classification of metastatic versus non-metastatic tumors.”

The authors pointed out that for the majority of patients with intermediate-risk breast cancer, the traditional factors are not strongly predictive, and that approximately 70–80% of lymph node-negative patients may undergo unnecessary adjuvant chemotherapy. Moreover, Dr. Ideker and his fellow bioinformatics specialists indicated that many of the current risk factors are likely to be secondary manifestations rather than primary mechanisms of disease.

An ongoing challenge is to identify new prognostic markers that are more directly related to disease and that can more accurately predict the risk of metastasis in individual patients.

Prognostic Implications

Investigators have recently accumulated further proofs of principle supporting the idea that gene-network analysis can inform prognoses. For example, Dr. Chang and colleagues at the UCSD and other institutions, writing in a 2012 issue of the journal BLOOD, considered how gene-network analysis could anticipate outcomes for patients with chronic lymphocytic leukemia (CLL), a blood cancer characterized by accumulation of monoclonal B cells in the blood, marrow, and secondary lymphoid tissues.

Specifically, the investigators used subnetwork-based analysis of gene-expression profiles to discriminate between groups of patients with disparate risks for CLL progression. The clinical course of patients with CLL varies greatly among patients. Some patients remain symptom-free for many years; others experience relatively aggressive disease and require therapy soon after they are diagnosed.

Because standard therapies are associated with significant toxicity, accurate prognosis is critical because current, curative recommendations are to withhold treatment until the patient shows clear evidence of disease progression or disease-related complications.

Several microarray studies have reported sets of genes that are useful as surrogate markers for known prognostic factors in CLL, such as the IGHV mutational status. Other studies have instead correlated gene-expression levels directly with median time of patient survival or progression-free survival.

The UCSD researchers said that from an initial cohort of 130 CLL patients, they identified 38 prognostic subnetworks that could predict the relative risk for disease progression requiring therapy from the time of sample collection. Moreover, these subnetwork markers yielded more accurate predictions than established markers.

The prognostic power of these subnetworks then was validated on two other cohorts of patients, in which the authors noted reduced divergence in gene expression between leukemia cells of CLL patients classified at diagnosis with aggressive versus indolent disease over time. The predictive subnetworks vary in levels of expression over time but exhibit increased similarity at later time points before therapy, suggesting that degenerate pathways apparently converge into
common pathways that are associated with disease progression.

The authors concluded that their analysis has implications for understanding cancer evolution and for the development of novel treatment strategies for patients with CLL and potentially other cancers. Such implications are the stuff of bioinformatics, which promises to make sense of vast data arrays and generate comprehensible findings.  


This article was originally published in the May 2016 issue of Clinical OMICs. For more content like this and details on how to get a free subscription to this digital publication, go to

This site uses Akismet to reduce spam. Learn how your comment data is processed.