The last decade has witnessed unprecedented growth in the volume and relevance of data generated by large-scale bioresearch, which encompasses high-throughput screening and high-content analysis as well as all the “omics.” In other words, bioresearch data is becoming ever more comprehensive and contextualized—and valuable. Not only is bioresearch data being used to reveal the molecular patterns behind health and disease, it is also being used to achieve personalized medicine.

For the biotech industry, the technology-driven increase in data volume is being amplified by a decrease in technology costs (Figure). For example, the cost of DNA sequencing has plummeted since 2001, when sequencing a genome cost $1 billion. Today, it costs around $100. Little wonder, then, that decreasing technology costs account for exponential growth in the volume of both private and public data.

Big Data creates big challenges

Satnam Surae, PhD, Chief Product Officer, Aigenpulse

Biotech companies invest heavily in generating the data that underpins their R&D. Data is, essentially, their most valuable asset, and it should instruct all business and pipeline decisions. New experimental approaches that harness microchips, immunoassays, and other biomolecular and imaging technologies are supported by massive genome sequencing initiatives such as the 100,000 Genomes Project and the 500,000 UK Biobank project.

The massive datasets generated mean that biotech is faced with a data husbandry problem. Storing all this data is no longer an issue; however, deriving usable insight from datasets derived from internal, third-party, and public sources is a huge challenge. In-house data management responsibilities are commonly fragmented across multiple locations, and the data itself is often dispersed, easily lost, and stored as “flat” PDFs or Excel spreadsheets.

Rapid growth requires tools that scale

The ability to scale quickly and smoothly hinges on integrating and, critically, leveraging increasing amounts and multiple types of data. Any IT infrastructure must be scalable, flexible, accessible, and intuitive, so that scientists can search for and query data and related metadata, whatever its format.

Maximizing the use of data from different sources is about making connections between data sources. (This task may be simplified through the visualization of data networks.) For example, in early drug discovery and development, marrying data derived from internal R&D with external gene expression datasets and toxicity data from outsourced CROs can identify the risk of failure of preclinical drug candidates in silico. Project leaders can then make informed decisions on how, or whether, to progress their candidates.

Shifting workflows demand flexibility

Imagine having access to data from all the genomes that have ever been sequenced. Would it be overwhelming or inspiring? The question is not as theoretical as it seems. Projects such as Genomics England will soon give us access to millions of sequenced human genomes. While this data represents an immensely valuable resource for research into rare diseases, and for target and drug development, it will be spread across many locations. The development of a single platform on which data from all the human genomes so far sequenced could be stored, without loss of perspective or depth, accessed securely, and in forms that can be interrogated alongside other experimental, analytical, and epidemiological data, would transform healthcare research.

Stepping into the digital age has proven challenging for the biotech sector, as it has for other industrial sectors, but pushing the boundaries of biomedical research is heavily reliant on the use of data assets that are not only large but also exhibit diversity. The need to cope with diversity, with respect to matters such as storage sites and formats, complicates attempts to derive maximum value from information stored within the same organization, let alone across platforms. Biotech must remain inherently agile, to enable seamless integration of new technologies and data types.

The ease with which digital transformations are negotiated depends on the industry. Retailers, for example, process data about their products and their clients, including data about when and where purchases have occurred. Conveniently for retailers, much of this data is static and based on text and numbers. It poses no data format or contextual issues, and it is easily transferred from spreadsheet programs to digital archiving programs.

Contrast this with the data-related issues faced by the biotech sector, which must deal with both structured and unstructured data pertaining to diverse sample types—genes, proteins, cells, tissues, etc. Biotech research is inherently dynamic. Although bioanalytical labs typically execute the same tests and workflows day after day, most biotech labs execute workflows that are neither constant nor narrowly focused. That is, they do not yield single points of data. Data context, type, and direction are constantly changing, complicating any digital transformation process that would establish data-driven systems for informing research and business decisions.

Traditional IT infrastructures rely heavily on static processes and secure long-term planning. The rigidity of this framework is misaligned with the need for agility in biomedical research. Only during the last decade has it become feasible, and more common, to build flexible principles into IT platforms. The ability to seat dynamic R&D on an informatics framework that supports testing, integration, deployment, and compliance has been instrumental in helping biotechs to digitally transform.

Biotechs need data awareness

Among biotech companies, capabilities for managing and leveraging data varies. All too often, a biotech will have only vague knowledge of its data holdings, where its data is stored, how to access its data, or how its data value can be extracted.

Organizations must understand that experimental results, positive or negative, must be properly recorded. “Properly” means that raw results must be accompanied by related metadata, such as contextual information about experimental conditions. Combinations of data and metadata are better than data alone at informing decisions about which research directions should be followed, or which development projects should be discontinued. The simple act of being able to search for and compare previous experiments also reduces repetition, costs, and wasted time. But data must also be secure, traceable, and auditable.

How biotechs can seize the initiative

Figure. The cost per raw megabase of DNA sequence has dropped precipitously. [National Human Genome Research Institute]
Even the most basic devices in our homes use software. Kitchen appliances, security cameras, thermostats, and more are getting “smart”—both as standalone gadgets and as components of integrated systems. The next stage in orchestrating our devices is the implementation of software that can draw data-based inferences.

Whereas home systems may predict when the milk will go sour, commercial systems will alert us to business threats and opportunities. In finance, banks and hedge funds learned hard lessons from the 2008 crash. Since then, they have progressed from managing billions of dollars in spreadsheets, to using high-security software and adopting sophisticated practices such as TDD (test-driven development), CD (continuous delivery), and DevOps (a combination of software development and IT operations that can shorten systems development life cycles).

Many of the data technologies that have been adopted in finance and other industries are applicable to the biotech industry. Biotech companies that apply data-driven decision-making tools, for example, may gain a competitive edge. A biotech may advantage itself by building a centralized, integrated data platform and by equipping R&D teams with a single intelligence solution that can be interrogated with respect to, say, a therapeutic gene target. The ability to collate and analyze in context all experimental data about a target will help companies make informed decisions on whether to progress that target. Platforms that do this automatically as soon as new data is entered into the system are becoming a priority.

Previous articleCRISPR Jump-Starts Gene Therapy
Next articleSelfish Genetic Elements Hasten Aging Process