Bioinformatics beyond Genome Crunching

0

November 1, 2015 (Vol. 35, No. 19)

DeeAnn Visk Ph.D. Founder and Principal Writer DeeAnn Visk Consulting

Flow Cytometry, Workflow Development, and Other Information Stores Can Become Treasure Troves If You Use the Right IT Tools and Services

Advances in bioinformatics are no longer limited to just crunching through genomic and exosomic data. Bioinformatics, a discipline at the interface between biotechnology and information technology, also has lessons for flow cytometry and experimental design, as well as database searches, for both internal and external content.

One company offering variations on traditional genome crunching is DNAnexus. With the advent of the $1,000 genome, researchers find themselves drowning in data. To analyze the terabytes of information, they must contract with an organization to provide the computing power, or they must perform the necessary server installation and maintenance work in house.

DNAnexus offers a platform that takes the raw sequence directly from the sequencing machine, builds the genome, and analyzes the data, and it is able to do all of this work in the cloud. The company works with Amazon Web Services to provide a completely scalable system of nucleic acid sequence processing.

“No longer is it necessary to purchase new computers and put them in the basement,” explains George Asimenos, Ph.D., director of strategic projects, DNAnexus.  “Not only is the data stored in the cloud, but it is also processed in the cloud.”

The service provided by DNAnexus allows users to run their own software. Most users choose open source programs created by academic institutions.

DNAnexus does not write the software to process and analyze the data. Instead, the company provides a service to its customers. It enables customers to analyze and process data in the cloud rather than buying, maintaining, and protecting their own servers.

“Additionally, collaboration is simplified,” states Dr. Asimenos. “One person can generate the data, and others can perform related tasks—mapping sequence reads to the reference genome, writing software to analyze the data, and interpreting results. All this is facilitated by hosting the process, data, and tools on the web.”

“When a customer needs to run a job, DNAnexus creates a virtual computer to run the analysis, then dissolves the virtual computer once the analysis is complete,” clarifies Dr. Asimenos. “This scalability allows projects to be run expeditiously regardless of size. The pure elasticity of the system allows computers to ‘magically appear’ in your basement and then ‘disappear’ when they are no longer being used. DNAnexus takes care of IT infrastructure management, security, and clinical compliance so you can focus on what matters: your science.”


Shown here is the FlowJo platform’s visualization of surface activation marker expression (CD38) on live lymphocyte CD8+ T cells. Colors represent all combinations of subsets positive and negative for interferon gamma (IFNγ), perforin (Perf), and phosphorylated ERK (pERK).

Merging IT and Flow Cytometry

Technical advances in flow cytometry allows the labeling of individual cells with up to 50 different markers; 12,000 cells can be counted a second. This flood of information overwhelms traditional methods for data processing in flow cytometry.

“FlowJo software offers a solution to this problem,” asserts Michael D. Stadnisky, Ph.D., CEO, FlowJo. “With an open architecture, our software serves as a platform that lets researchers run whatever program or algorithm they wish. Scientists can focus on the biological questions without having to become computer programmers.”

FlowJo presents an intuitive and simple user interface to facilitate the visualization of complex datasets.

While still in development (beta testing), FlowJo is offering plug-ins. Some of them are free, and others are for sale. They include software components for automatic data analysis, the discovery of trends and identification of outliers, and the centralization of data for all researchers to access. Applications for FlowJo range from traditional immunology to environmental studies, such as assessments of aquatic stream health based on analyses of single-cell organisms.

“Ultimately, FlowJo wants to offer real-time analysis of data,” discloses Dr. Stadnisky. “Presently, we have the capacity to process a 1,536-well plate in 15 minutes.”

FlowJo’s platform has benefitted users such as the University of California, San Francisco. Here, researchers in the midst of Phase I clinical trial were facing 632 clinical samples with 12 acquisition runs and 12 different time points. By employing FlowJo, the researchers realized a 10-fold reduction in the time spent analyzing all data.

Clients have also integrated other data types. For example, they have integrated polymerase chain reaction (PCR), sequencing, and patient information with data from FlowJo, which facilitates this type of cross-functional team work. The data output from FlowJo, the company maintains, is easily accessible by other scientists. The platform is available as a standalone system that can be installed on a company’s computers or be hosted on the cloud.


Life scientists are being overwhelmed by the huge amounts of data they generate for specialized projects. They not only look for solutions within their own organizations but also increasingly enlist the help of service companies to help them with Big Data overload. [iStock/IconicBestiary]

Optimizing Experiments

One dilemma facing large pharmaceutical companies is the need to optimize conditions with a very limited supply of a precious reagent. Determining the best experimental design is crucial to avoid wasting valuable resources.

Roche has used a commercially available electronic tool to build a workflow support tool. “This application allows scientists to set up their experiments more efficiently,” declares Roman Affentranger, Ph.D., head of small molecular discovery workflows, Roche. “The tool assists scientists in documenting and carrying out their work in the most effective manner.”

“Frequently, a quick formulation of a peptide is necessary to hand over to a toxicologist for animal testing,” continues Dr. Affentranger. “The formulation of the peptide needs to be optimized for the pH, the type of buffer, and the surfactants, for example. The tool we developed evaluates the design of the scientist’s experiment to use the minimum amount of the precious resource, the peptide in question.

“Testing these various conditions rapidly turns into a combinatorial problem with hundreds of tubes required, using more and more of the small sample. Our system assists scientist in documenting and carrying out work, taking the place of finding a colleague to evaluate your experimental design.”

“The data is entered electronically rather than printed out as hardcopy and glued into a notebook,” points out Dr. Affentranger. “Consequently, the information is readily accessible within the lab, across labs, and across the global environment we all work in today.”


Indexing Internal Content

Another issue facing large, multinational pharmaceutical companies is finding material that they previously acquired. This could be as simple as a completed experiment, an expert in a content area, or an archive-bound business strategy analysis.

To address this issue, a company could index its internal content, much the way Google indexes the Internet. At a large company, however, such a task would be onerous.

Enter Sinequa, a French-based company that provides an indexing service. The company can convert more than 300 file formats such as pdfs, Word documents, emails, email attachments, and PowerPoint presentations into a format that its computers can “read.”

According to Sinequa, a large enterprise, such as a pharmaceutical company, may need to cope with 200 to 500 million highly technical documents and billions of data points. This predicament is akin to the situation on the web in 1995. It was necessary to know the precise address of a website to access it. This unnecessary complication was eliminated by Google, which indexed everything on the web. Analogously, Sinequa offers the ability to index the information inside a company so that searches can yield information without requiring inputs that specify the information’s exact location.

With this kind of search ability, a company can turn its information trove into a treasure trove. Put another way, information can be made to flow, keeping applications turning like turbines, generating the “data power” needed to reposition drugs, reduce time to market, and identify internal and external experts and thought leaders.

“Sinequa offers a kind of Google algorithm customized for each customer,” details Xavier Pornain, vice president of sales and alliances at Sinequa. “At least 20,000 people use the technology generated by Sinequa. Modern companies create lots of data; we make it searchable.”

The data searched is not limited to internal documents. Sinequa can also add in external databases or indexing sites such as PubMed, Medline, and Scopus. Of demonstrated flexibility, the search engine can run one version inside a company firewall and another one in the cloud.


Emulating Intelligence Approaches

A different search approach, one that leverages the experience of the intelligence community, it taken by the Content Analyst Company. With this approach, a company can comb through internal and external content stores to find relevant information that has value not only as output, but as input. That is, the information can cycle through the search engine, turning its machine learning gears.

“By adapting to the voice of the user, our software package, Cerebrant, has been very successful in the intelligence and legal communities,” says Phillip Clary, vice president, Content Analyst. “For typical indexing services, such as Google and PubMed, people do huge searches using a long list of key words. A simpler scenario is to write a few sentences, enter the text, and get all the related relevant items returned. Cerebrant can take the place of an expert to sift through all the results to find the relevant ones.”

Typical searches often yield confounding results. For example, if a user were to ask Google to generate results for the word “bank,” the top results would be financial institutions. Then there would be results for a musical band/person named Bank. Eventually, long past the first page of results, there would be information about the kind of bank that borders a stream or river course. Such results would frustrate a scientific user interested in tissue banks or cell line repositories.

“In the past, companies have approached the problem of obtaining germane results by attempting to create databases with curation and controlled vocabulary,” notes Clary. “This is how Google works. All those misspelled words have to be entered into the code.

“Cerebrant functions by learning how the information relates to itself. This was a powerful tool for the intelligence community, because the program can look at all kinds of information (emails, texts, metadata) and make connections within the unstructured data, even when users attempt to veil their meanings by using code words.”

Search requests composed on Cerebrant can consist of a single sentence or a paragraph describing what sort of information the user wishes to find. This is much more efficient than determining the 30 to 40 keywords you need to use to locate all the information on a complex topic. Then there is still the task of removing the irrelevant finds.

Cerebrant is a cloud-based application. Generally, it take only about a day to a week to get it up and running. Because it is scalable, Cerebrant can be used by an individual consultant or a multinational conglomerate.

Given the enormous amount of time, energy, and money invested by the intelligence community, it is refreshing to see a novel application of the wisdom gained from all this work, just as we saw innovative uses of the technology that was developed by the space program. 



























This site uses Akismet to reduce spam. Learn how your comment data is processed.