Interconnected, Open Data
With the publication of the ENCODE analysis results, we wanted to go beyond simply releasing the data and provide additional resources that would allow readers to explore the data, reproduce the analysis, and make connections between the set of papers published together.
The goal of open data use is part of the larger open-access literature discussion, where there remain substantial technical and legal considerations. On the technical side, the burden on authors is already substantial in terms of manuscript and data preparation alone. Online submission systems are already groaning under the strain and to do more with the data will require standardized formats and interfaces to be created.
Just shoehorning the 450 odd authors of the ENCODE consortium integrated paper into the Nature online submissions system was a taxing experience. Although a trivial example, my point is that most authors have neither the time nor the inclination to get to grips with even minor obstacles to open data use, so the path needs to be made as smooth as possible.
One way to advance toward reusable data now is simply to see what you can provide, and this is what we tried to do. Ewan and I were able to interest Nature in enhancing the ENCODE publications with interactive graphics for some of the main figures, and with the “thread” concept that would bring together the related science woven throughout the many papers, providing it in one place accessible via the Internet or the associated app.
In a sense, the threads constitute our own meta-review of the papers. These publication embellishments were well executed by the Nature team and were, we hope, fun and informative.
A much more important development was the provision of working versions of the computer code that was used in the main ENCODE integrative paper, which is offered up in the form of a virtual machine (VM). We had initially prepared annotated versions of our code along with the manuscript, and these acted essentially as the experimental methods.
However, in the review process we realized that we could perhaps go a step further and provide working versions of the analyses, so that anyone with a reasonable computer could directly reproduce the results, and work with the code to develop their own analysis.
As anyone who works in bioinformatics will know, moving software to a new infrastructure can be hairy, but migration to a Linux-based VM worked well because most of the analysts were using standard, open-source software and libraries.
We were able to test rapidly that everything would run and that all the required data was on board. This is not a perfect solution by any means. The VM is large (18 Gb) and expands to require quite a large footprint, but it does provide a transportable functioning distillation of our analysis methods, and I was very excited to see people downloading and using it.
In the long run, I envisage a publication environment wherein research papers are openly accessible and the data and analyses within each paper can be directly accessed for reuse across the Internet. I can imagine a future where one doesn’t just publish and forget, but each publication element contributes to a highly interconnected, worldwide research resource. Many people are already working toward this goal in different areas but a push must come from the research community to really make it happen.