June 1, 2012 (Vol. 32, No. 11)
Zachary N. N. Russ Bioengineering graduate student UC Berkeley
Complete and Open Reporting of Data Would Paint a More Accurate Picture of Results
When President Barack Obama requested $30.86 billion for the NIH in 2013, he matched precisely the amount allocated in 2012. Unfortunately, as with most things, standing still means you’re falling behind. Inflation continues to eat away at budgets, and, while we’ve seen prices for DNA synthesis and sequencing decline, the prices of labor and consumables (pipette tips, reagents, growth media) hold steady.
2012’s budget allocation featured a budget cut for the NIH as well as the creation of the National Center for Advancing Translational Sciences, which strives “to identify and overcome hurdles that slow the development of effective treatments and cures.” But here’s a question: why is this center necessary? Shouldn’t the FDA-guaranteed incentives of market and/or data exclusivity (and more for orphan drugs) entice private R&D to jump on every new lead as soon as it’s published?
Apparently not. The costs of approval are significant, and the number of potential drugs falling short on their trials is staggering as well. Outside of the FDA, there are few agencies that have the capabilities to address this problem, and the NIH is one of them.
Good News, Bad Information
Even before these drugs get to trials we see a serious lack of reproducibility. Both Amgen and Bayer found they could only confirm published results in 6/53 and 13/67 cases, respectively. It’s hard to estimate what fraction of publication data is misleading, but it is easy to see where such errors could stem from.
“Publish or perish” holds strong for most publicly funded researchers, and positive results make for a better story than negative ones. Publication bias is best summed up with this quote from the Amgen team’s inquiry: “He said they’d done it six times and got this result once, but put it in the paper because it made the best story.”
Furthermore, labs tend to specialize in particular cell lines and phenomena. While this is necessary to delve into the more esoteric parts of the science, it also leads to situations where a single lab is driving the publications on a particular chemical or cell line. This means that confounding variables such as the lab’s water supply or experimental bias are not addressed until a tower of publications and grants stands on that foundation.
The results gleaned from these research grants are considerably more expensive when the cost of verification is included. Fortunately, we’ve already paid for that verification, we’ve just been throwing it away. Many labs will apply a variety of approaches in their initial screen for an experiment. For instance, in the service of a grant to find ways to treat disease X using pathway Y, a lab may try out some 10 or 20 compounds that have previously been identified from the literature. But if some of those compounds don’t function the way they’re supposed to, those results are as likely to disappear into a lab notebook as they are to be published. The experiments were done, the data analyzed, and the results stored—but not shared.
While this data might not be generated with the same precision as publication-grade data, it may serve as a valuable early-warning system for results from confounding variables. Requiring cataloguing of “failed” trials would also provide some resistance against publication bias. By adding incentives for sharing experimental data and also metadata (“left the plates out in the hot sun” or “had the crazy undergrad do that experiment”) we can recapture all of that experimental data acquired (and discarded) in the pursuit of those positive, published results.
NIH is the one handing out the money. It has both the power to reward experiments in reproducibility and also the ability to dictate what happens to the data produced.
Just as steam co-generation allows us to capture energy that would otherwise be dissipated, full and open reporting of results allows capture of data that would be lost to the filing cabinets. Whether it’s heat or data, an investment in infrastructure is necessary to mediate that capture.
For the experiments, that would be a versatile and accountable database where labs can easily upload the raw data acquired on the path to publication. Everything (margin scrawl included) would be acceptable, encouraged, but not necessarily required. The data would be linked to the relevant subjects by a couple of tags, many of them automatically filled in.
For instance, if you were to upload a Western blot from a knockdown experiment, the database upload tool would remember what cell line you said you were using, which lab, and which location. It’s more than an open notebook; imagine trying to find three labs’ experiences with the same protein, when they all made their own plasmids to express it.
The same thing could be called pAH007 or pBjh1601CK-znr104b depending on the lab. Maybe one lab had a mutation in their plasmid and didn’t notice; how could you find a consensus? It’s also more than a new journal just for errors, because its links run deeper than citations.
So how do you build such a thing? I can think of only one organization that can: the NIH’s National Library of Medicine, or really its subdivision NCBI. NCBI brought us PubMed, Entrez, and Genbank. Perhaps it’s time to ask them to finish the job. The task of building such a database—to address the myriad forms of research data, file formats, and link everything together—is a Herculean effort. But the alternative—i.e., the status quo—is a Sisyphean ordeal.
Zachary N. Russ (firstname.lastname@example.org) is a bioengineering graduate student studying synthetic biology at UC Berkeley.