#1. R&D Data Integration for Collaboration
The ability to manage, integrate, and link data across R&D stages in pharma might enable comprehensive data search and mining that identify better leads, related applications, and potential safety issues. However, the practicalities of dealing with big data are substantial. The amount of data generated is growing significantly: In 2007 a single next-generation sequencing machine run could produce a maximum of one gigabyte of data, but by 2011, nearly a terabyte could be created—representing a 1,000-fold increase. The sequence alone is much more useful as it is correlated with phenotypes and other types of data. This has naturally affected the way companies think about data storage and structure, with centralized data repositories and cloud solutions becoming more popular. The two leaders in next-gen sequencing technologies, Illumina and Life Technologies, both now offer cloud solutions for data storage and analysis to meet this growing need.
No less important is the enormous opportunity in data from images. A few years ago, high-content analysis drove the need for simple storage solutions. Today, digital pathology is leading the way with pioneering solutions for datafication of tissue, so that it can be mined and correlated with other types of data such as clinical outcomes or genomic data. While in the past it was impossible to effectively mine image data, researchers such as Andy Beck at Harvard have used image analysis solutions to analyze thousands of image features to discover new biomarkers that correlate with clinical outcomes.
In both the case of next-generation sequencing and image analysis, the most value is achieved when researchers are able to not only merge different datasets, but also to conduct advanced analytics and correlations between data types. For example, advanced statistics might show that when a particular gene of interest is mutated, the level of its phenotypic effect can be correlated with a particular tissue marker, and conclusions might be drawn about its increased effectiveness or safety. Or it might show that a particular tissue lesion is always associated with a safety risk when statistics are performed across lead compound studies. It is this level of integration and correlation where big data will provide the most benefit.