With analytical talent in short supply, the Big Data initiative has its eye on graduate programs for data scientists and engineers. [© Sergej Khackimullin - Fotolia.com]
It’s been a long-time lament among researchers ever since the human genome was first sequenced: What is to be done with all that data? President Barack Obama’s administration announced last month its answer: $200 million in commitments intended to improve tools and techniques for accessing, organizing, and gleaning discoveries from all this data, now dubbed big data.
The “Big Data Research and Development Initiative” is intended to boost not only the nation’s biomedical research and scientific discovery efforts through big data, but environmental research, education, and national security as well. For the Obama administration, the initiative is not only a blueprint for more science-focused federal spending, no matter the hand-wringing about trillion-dollar deficits on both sides of Pennsylvania Avenue, but a clarion call for biopharma companies and academia to join Washington in capitalizing on big data.
“Clearly, the government can’t do this on its own. We need what the President calls an ‘all hands on deck’ effort,” Tom Kalil, deputy director for policy at the White House Office of Science and Technology Policy, wrote in a post on the White House blog.
Why such interest in big data? The White House expects big data to generate big jobs, no small concern in an election year like this one. Job creation, which weakened last month, is among the major issues separating President Obama from his likely Republican challenger, Mitt Romney. How many jobs? A study last year by the McKinsey Global Institute projected that by 2018, the U.S. alone “faces a shortage of 140,000 to 190,000 people with deep analytical skills as well as 1.5 million managers and analysts with the know-how to use the analysis of big data to make effective decisions.”
“In short, the United States will need an additional supply of this class of talent of 50 to 60%,” McKinsey concluded. “Addressing the talent shortage will not happen overnight, and the search for deep analytical talent that has already begun can only intensify.”
Programs to Boost Informatics
To that purpose, as part of the initiative, NSF will work with universities to develop interdisciplinary graduate programs to train students for careers as data scientists and engineers. NSF will also award a $2 million grant for a research group to help train undergraduates in using graphical and visualization techniques for complex data.
More toward research, NSF will spend $1.4 million to support a group of statisticians and biologists who will collaborate to discover the structures of proteins and biological pathways. NSF will also award a $10 million “Expeditions in Computing” grant to researchers at University of California, Berkeley whose AMPLab applies machine learning, cloud computing, and crowd sourcing to tackle projects.
Additionally, NSF and NIH will join award grants under a new program to promote core techniques and technologies for managing, analyzing, visualizing, and extracting useful information from large and diverse datasets. NIH is especially interested in imaging, molecular, cellular, electrophysiological, chemical, behavioral, epidemiological, clinical, and other datasets related to health and disease, Karin Remington, Ph.D., director of the division of biomedical technology, bioinformatics, and computational biology at NIH’s National Institute of General Medical Sciences (NIGMS), told GEN.
The agencies will award “mid-scale” grants for groups of three or more investigators ranging from $250,001 to $1 million per year for up to five years as well as smaller-scale project grants for one or two investigators of up to $250,000 per year for up to three years. Application deadlines are June 13 for the mid-scale grants and July 11 for the smaller grants.
In another key project of the initiative, NIH opted to store the 200 terabytes of data so far yielded from the 1000 Genomes project on the Amazon Web Services cloud and allow the public free access to all that data. Researchers will be charged for downloading the data or computing with the data, Dr. Brooks added. It’s still a bargain, she added, compared with the hundreds of thousands of dollars in computing equipment that universities would have to spend for the needed computing capacity.
At present, with two phases of work completed, the 1000 Genomes project consists of DNA sequenced from about 1,700 individuals, a number set to grow to 2,661 individuals in 26 populations by year’s end, Lisa D. Brooks, Ph.D., program director for the Genetic Variation Program at the National Human Genome Research Institute, told GEN.
Another piece of the big data initiative has CDC’s Special Bacteriology Reference Laboratory (SBRL) developing tools for new species identification designed to allow multiple analyses on a new or rapidly emerging pathogen to occur in hours, rather than days or weeks.
CDC will also upgrade its nearly decade-old BioSense program, a national public health surveillance system for early detection and rapid assessment of potential bioterrorism-related illness. BioSense 2.0 will be expanded to connect with state and local health departments as well as to contribute information for public health awareness, routine public health practice, and improved health outcomes and public health.
The administration’s big data initiative is a partial response to a 2010 report by the Presidential Commission on Science and Technology that recommended Washington support more cross-agency projects and spend more on networking and IT research.
In addition to NSF, NIH, and CDC, the initiative also involves the defense department and its Defense Advanced Research Projects Agency; the energy and homeland security departments; and the U.S. Geological Survey. More than a half-dozen federal departments and agencies working on 80-some projects hold more than a little potential for duplicative projects that respect agency turf more than taxpayer dollars, Doug Henschen, executive editor of InformationWeek, recently observed.
Concern for waste is well founded. According to the Chief Information Officers Council, the federal government quadrupled its number of data centers between 1998 and 2010. On average these centers were using only 27% of their computer power—a percentage Washington hopes to raise through the Federal Data Center Consolidation Initiative (FDDCI).
In November, officials unveiled plans to halve the number of federal data centers from 2,094 to 1,132 by 2015, at a $5 billion savings. These include 31 NIH data centers and one center each for CDC and FDA. But just last month, the Office of Management and Budget changed the plan by redefining data centers to include facilities of smaller than 500 square feet. The new number of 3,133 data centers will be sliced by “at least 1,200” by 2015, representing a roughly 40% cutback.
Consolidation should entail more cross-disciplinary projects such as the Open Science Grid funded by NSF and the energy department. The grid supports some 30 virtual organizations at 80 sites worldwide, enabling some 8,000 scientists across the globe to support projects in structural biology as well as astrophysics, high-energy physics, and nanoscience.
Henschen cited a successful academic example of cross-disciplinary data collaboration with a life science application: Over the past two years, Johns Hopkins University has created a “Data-Scope,” a data supercomputer funded with a $2.2 million NSF grant. Data-Scope combines storage, fast input-output, and stream-processing capability with racks of graphic processing units.
The big data initiative offers another opportunity to not only ramp up some beneficial projects but to use savings creatively. Consolidation savings should be plowed into additional big data projects; perhaps more data could join the 1000 Genomes on the Amazon cloud or another cloud service. Or perhaps the initiative’s healthcare delivery piece could be expanded; McKinsey estimated the nation could save $200 billion, or about 8% of total expenditure, through big data.