January 15, 2007 (Vol. 27, No. 2)
Although cancer remains a leading cause of death in the United States, our understanding of the disease has leapt dramatically forward in recent years with the development of new technologies that enable researchers to understand its molecular origins.
We now know that cancer is actually many different diseases, each distinguished by a complex set of molecular pathways that drive its emergence, growth, and metastasis. These pathways can vary from one patient to the next, requiring sub-classification of diseases at the molecular level to guide effective treatment. Molecular or personalized medicine holds great potential to improve patient outcomes.
The vast amount of data required for and generated by molecular cancer research has overwhelmed the existing information technology infrastructure found in most biomedical organizations. If molecular medicine is to realize its potential, it is necessary to find new ways to acquire, store, access, analyze, and share critical data. One such effort is the National Cancer Institute’s (NCI) cancer Biomedical Informatics Grid (caBIG™), an open-source informatics network that offers the possibility of managing and sharing increasingly complex data sets.
The Data Deluge
Information technology has spurred the growth of global collaborations throughout society by connecting people in meaningful ways and providing them with new information tools. Yet, by comparison, life sciences has trailed in this regard. Cancer research, an undeniably vast and overwhelmingly intricate field, contains a wide variety of complex processes, technologies, and materials that must converge in order for researchers to have a complete and illustrative view of specific cancer types.
For example, clinical records on large patient populations can be combined with genomic and proteomic data to yield information leading to a new generation of diagnostic tests and targeted therapies. These and other biorepositories are now at the nexus of clinical and basic research.
Today, biomedical research is generating copious genomic, proteomic, medical record, and biospecimen datasets at disparate locations and in varied formats; this makes it difficult, at best, to leverage the data’s full value through integration and comparison. Individual biomedical research labs can generate up to 100 terabytes of data—akin to one million encyclopedias of information. This data, which is doubling in volume every 15 months, is limited in its utility because of a lack of systems and technology for properly managing them.
In addition to the technical complexity, the culture of biomedical scientific investigation itself presents a challenge. Academic and industrial research in the life sciences is charged with fierce independence, creativity, and entrepreneurialism, necessary ingredients for achieving the innovations that have benefited many patients, but potential obstacles to a synchronized effort to compile comparable data from many sources for analysis.
Although a growing population of bench and clinical researchers within the wider life sciences community have expressed a willingness and desire to share and combine data, as well as the tools used to analyze that data, the large number of different formats, conventions, and applications make it an overwhelming task for any individual or group to undertake.
Responding to the need to create greater connectivity throughout the cancer research community, the NCI established caBIG to help align the many disparate parts of the cancer research process. caBIG includes software tools and systems to enable integration of clinical information with molecular information.
To help create a networked community of researchers, caBIG incorporates tools to bridge existing databases; establishes common data standards to ensure that the entire caBIG community is speaking the same language; provides a common set of statistical and visualization tools to analyze data from microarrays and other high-throughput technologies; and facilitates the exchange of biospecimens, reagents, and ideas. Collectively, these technologies and the growing community of users are establishing caBIG as a World Wide Web of cancer research.
caBIG at Work
An early implementation of caBIG illustrates the extent to which collaborative research is tackling even brain cancer, the most vexing of diseases to study and treat. A set of tools and databases for the collaborative study of brain cancer was developed by the neuro-oncology branch of the NCI using caBIG technology. The project was dubbed REMBRANDT, which stands for Repository for Molecular Brain Neoplasia Data.
Because brain tumors are both rare and diverse, individual researchers, or even entire research institutions, might not have a statistically significant number of specimens to study for any particular class of tumor. REMBRANDT creates a shared environment to overcome this hurdle. A neurosurgeon in Boston can share a tissue sample with a neuropathologist in New York, together they can then record clinical data about the patient’s progression of disease and response to therapy. Meanwhile, a molecular biologist at NCI conducts gene expression studies to obtain a molecular classification of the tumor that correlates with the disease state or response.
The statistical power of the study is enhanced by enabling multiple clinical centers to coordinate patient recruitment and manage a virtual repository of tumor specimens and data. REMBRANDT has been used in support of the NCI-sponsored Glioma Molecular Diagnostic Initiative (GMDI), the largest genetic/clinical corollary study ever conducted on gliomas. By closing the loop among bench scientists, clinicians, and their patients, collaborative networks, such as REMBRANDT and its parent, caBIG, will make many more new discoveries possible.
Connecting Data Sources
Another caBIG technology demonstrates the value in simultaneously connecting data sources while also providing intuitive software to access the information contained within the databases. Cancer Molecular Pages is a caBIG software application that provides a virtual catalog of proteins.
The catalog is continually updated, easy to search, and includes visual images of proteins that are of special interest to cancer research. The catalog integrates data from sources across the country and reflects the most up-to-date research on particular proteins, giving scientists immediate and comprehensive access to vital information.
caBIG also serves as the supporting infrastructure for several other advanced technology cancer research initiatives. For example, caBIG is the information technology platform supporting The Cancer Genome Atlas, the large-scale genome sequencing initiative that is exploring the universe of genomic changes involved in all types of human cancer, starting with brain (glioblastoma), lung, and ovarian cancers.
The Clinical Proteomics Technology Initiative, a multiyear, multi-institutional effort to standardize protein biomarker discovery and validation, has already developed several caBIG-compliant shared information tools and continues to build on those so that huge mass spectrometry datasets can be shared and compared across platforms and laboratories.
A Model for the Future
Finally, efforts are under way to employ caBIG as a major connectivity and data-management platform for the NCI’s Alliance for Nanotechnology in Cancer.
The culture of research in the life sciences has already begun to shift. There is a growing recognition that team science is the community’s best bet in materially advancing the delivery of more effective patient therapies. The realization of this vision will require both a dynamic technology network and a more dynamic workflow.
Today, more than 80 cancer centers and research organizations and 900 individuals work on caBIG projects in all areas of cancer. As caBIG continues to grow and more research centers become part of the grid, it will create a semantically intelligent network that can serve as a role model for other biomedical areas, a role model that is urgently needed, as the needs to more effectively manage data and collaborate are not unique to cancer.
As the infrastructure of molecular medicine, the seamless informatics of caBIG will continue to enable researchers to spend less time building tools and chasing resources and more time testing hypotheses, problem solving in groups, and, ultimately, improving quality of life for patients.
Kenneth H. Buetow, Ph.D., is associate director, bioinformatics and information technology, for the National Cancer Institute and director of the NCI’s Center for Bioinformatics.
Web: www.cancer.gov. E-mail: buetowke@ mail.nih.gov.