It is now hard to imagine a biological laboratory without bioinformatics tools. In the last 10 years, bioinformatics has surpassed its original definition of software technologies aiding in annotation and manipulation of genetic information. Today’s bioinformatics is at the intersection of biological and information sciences, absolutely necessary to manage, process, and understand large amounts of data.
Explosive growth of biological techniques necessitates acquisition, processing, analysis, and integration of the data generated by experiments as diverse as cloning, expression analysis, mass spectrometry, genotyping, cellular analysis, and others. Pioneers of new bioinformatics make forays as far as predictive toxicology and patient stratification for clinical trials.
Integration is the buzzword of the new bioinformatics erabetween experimental data from different sources, between results and supporting products, and between the laboratories and collaborators.
The World Bio-informatics Market (2005-2010) report published by Research and Consultancy Outsourcing Services (RNCOS) estimates the global bioinformatics market at roughly $1,434 million by the end of 2005. Many would argue that this number is too low and largely based on the traditional definition of bioinformatics as a gene and protein sequence-analysis tool.
The RNCOS report also predicts that the spending for bioinformatics is not expected to grow beyond 10 percent per annum, meaning that for all the noise, bioinformatics should remain a niche industry for a foreseeable future. Again, however, the bioinformatics community strongly disagrees with the niche label.
Those applications of bioinformatics that surpass the traditional sequence analysis already experience significant growth. For example, the market for laboratory information management systems (LIMS) is growing at close to 16% annually. Applications designed for large-scale search and retrieval of a variety of data are just entering the market. Software for running multiple simultaneous applications is still in the proof-of-concept stage. In general, enterprise infrastructures are poised to become a major player in the bioinformatics market, driving the growth well beyond current predictions.
Downstream Business Drivers
Rampant accumulation of comparative gene-expression data in the early days of genomic discoveries overflowed drug development pipelines. Even though the data generated just a handful of qualified drug targets, this experience became a foundation for the new set of bioinformatics tools for the analysis of the interaction between drug candidates and living organisms.
Gene Logic (www.genelogic.com) systematically collects and analyzes tissue samples of animals treated with various drug candidates. Cross-referencing gene-expression profiles with drug structures became the foundation of Gene Logic’s predictive algorithms, incorporated into its ToxExpress System. It aims to predict the classical toxicities caused by particular chemistries.
We have the most comprehensive database of gene expression across species and across tissues, emphasized Larry Mertz, Ph.D., vp of R&D and product management. To date, the company has accumulated over 14,000 tissue samples, has catalogued the effects of at least 200 different reference compounds, and has run at least 400 customer compounds using its predictive models. According to the company, even without prior knowledge of the compound, organ toxicities, such as in the liver, can be predicted with over a 90% success rate.
Gene Logic places a special emphasis on early data points, within hours of administering the compound, and on mild stress inducement, which is often overlooked during classical toxicity screening. Clients can license either the database and/or predictive algorithms or contract the company for custom analysis of their compounds.
Gene Logic lends its predictive algorithms to assist pharmaceutical companies in stratification of the population for clinical trials. The BioExpress database contains over 18,000 clinically annotated patient tissue and blood samples and their gene-expression profiles. The sample population spans a wide range of normal, diseased, and drug-treated specimens and focuses on major disease areas, including cardiovascular, oncology, central nervous system, inflammatory, and metabolic diseases. Clients can use data from this reference database to identify drug targets and biomarkers that can be used to enable selection of a more adequate subset of subjects for clinical trials.
The real strength of our approach comes from the breadth and depth of our sample collection, as well as from our high standards of acquisition, processing, and data-analysis approaches, concludes Dr. Mertz. We will continue expanding our database with more samples and improving our predictive algorithms by expanding our long-standing collaboration with pharmaceutical companies that began in 2000 with our Tox Consortium and continues today.
Demand for data management and integration is expanding beyond the research domain, adds John Bainbridge, general manager–LIMS, Applied Biosystems(www.appliedbiosystems.com). Downstream applicationsfrom product development to manufacturing and QA/QCare emerging drivers for bioinformatics development. Regulated industries, such as forensics, food and beverage, and environmental testing, also demand streamlining and traceability of data. These industries are particularly concerned about audits, certifications, compliance, and integration with ERPs and MRPs.
Applied Biosystems’ SQL*LIMS software integrates and manages an organization’s laboratory business processes. The system is integrated with Oracle technology and can be run off a single server. There are two major advantages to our system, continues Bainbridge.
First, SQL*LIMS is a web-based application. This feature makes the system easy to deploy, maintain, upgrade, and integrate into enterprise business systems. Global companies get enhanced system performance without incurring significant implementation costs. Many companies also demand multilanguage capabilities, which we have incorporated in our system. Second, we provide professional services to tailor LIMS application for customer-specific business needs.
For example, when a drug company is ready to enter the manufacturing stage, SQL*LIMS would help to trace raw materials origins, lots and batches, formulations, personnel performing the tasks, and test results, according to the company. The results from analytical instrumentation come directly into the LIMS system and are recorded in the electronic notebooks. The end result may be a certificate of analysis and the approval of the product release.
The current trend in LIMS is toward preconfigured, out-of-the-box solutions for specific application areas, such as human identification in forensics. Such applications typically process samples based on industry-standard practices. Preconfigured solutions decrease an organization’s implementation costs and save time.
Integration of Data
Graphical visualization of a pathway is only the beginning of the integration trend, comments Jordan Stockton, Ph.D., marketing manager, informatics, Agilent Life Sciences and Chemical Analysis (www.chem.agilent.com). What everyone wants to see is the overlap of various types of measurements, such as gene expression, metabolic profiles, DNA-protein binding events, and chromatin remodeling. We know how to run these experiments, but the informatics is a real bottleneck. Software providers are still a step behind the demand for the bridging of these technologies.
Agilent provides instrumentation and tools for various types of genomic and proteomic analysis and integrates the findings via its GeneSpring Analysis platform. The workgroup-enabled component facilitates the exchange of the data between the users. The plug-in modules for GX (gene expression), GT (genotyping), CGH (comparative genomic hybridization), and MS (mass spectrometry) are able to analyze and cross-reference data on a large scale. All modules have similar interface, making it easier to learn the software.
There are a number of the reasonable home-brew analysis programs, contends Dr. Stockton. However, none of them measure up to the level of processing and integration that Agilent provides.
BioDiscovery (www.biodiscovery.com) helps companies integrate the data via the GeneDirector, a comprehensive microarray data-management solution. The enterprise software package provides a comprehensive solution for the microarray workflow process, starting from sample management and tracking, through automated image analysis and results generation. The program ensures high-quality data by using an Oracle-based data-management platform that maintains and enforces relationships among all data generated in the experiment.
BioDiscovery provides software modules compatible with most popular microarray instruments, such as Agilent’s and Affymetrix’(www.agilent.com), and plans to come out with CGH and MS tools in the near future. Many companies offer exploratory desktop analysis tools, says Soheil Shams, Ph.D., founder and president of BioDiscovery. It is a crowded space with limited potential for qualitative software improvement. In contrast, we come in with an infrastructure for systematic exploration using standardized quality control and analysis tools.
Our upcoming ARM (Array Result Manager) System provides a novel interface, enforcing analytical SOPs. A company will be able to perform routine data analysis according to their own SOPs and store, analyze, and retrieve the analysis results in a uniform, easily traceable, and compatible format.
BioDiscovery provides flexible licensing terms, depending on the number of the modules or on the in-house-derived analysis software. Many companies believe that they can write the best analysis algorithms themselves. You could say that if these algorithms were cars, we are providing the roads for them to ride on, adds Dr. Shams.
The key issue faced now by bioinformatics providers is integration of different algorithms and data in today’s bioinformatics environment, says Darryl Gietzen, Ph.D., product manager for bioinformatics at SciTegic (www.scitegic.com), a wholly owned subsidiary of AAccelrys (www.accelrys.com). The company’s Pipeline Pilot platform organizes streams of data coming from different sourcesmicroarray, sequence, chemistryand in different formatstext, database, binary, numerical.
Pipeline Pilot software enables processing, analysis, and mining of large volumes of data via a user-defined computational protocol. The data is piped in real time through a network of modular computational components. The data path can be easily changed by shuffling the computational components in the graphic interface.
Pipeline Pilot technology eliminates the boundaries of individual databases. For instance, a sequence query processed by Pipeline Pilot may contain the following modulesread Affymetrix ID, map to a gene ID, collect gene ontology (GO) information, collect KEGG information (Kyoto Encyclopedia of Genes and Genomes), map to a chromosome, map SNPs, retrieve gene sequence, perform a BLAST against a patent database, collect best hit, and create a cumulative report. The results of the query are summarized in customizable reports.
The web client for Pipeline Pilot enables anyone on the intranet to process his or her data through the predesigned pipelines. Accelrys is implementing Pipeline Pilot to manage and integrate the data flow through its numerous software products.
The Vector NTI software package from Invitrogen (www.invitrogen.com) is a well-recognized benchtop tool for DNA analysis and manipulation. Current research environment demands more and more integration. Software providers have to adapt their products to ensure smooth integration with the existing online tools, IT infrastructure, and data stream from various applications. Whatever the plans and results of the day-to-day bench work, they have to be managed by a single portal, says James Caffrey, Ph.D., marketing manager, bioinformatics.
Integration of Data and Reagents
Invitrogen addresses integration needs by incorporating broad packages for DNA and protein analysis tools under a single, easy-to-use interface. Vector NTI software gives users the ability to easily exchange data between on-line tools and their desktops and to perform all the required analyses without having to switch between the applications.
Furthermore, Invitrogen says that it links the results of in silico exploratory work with the online ordering of reagents, such as primers, vectors, cloning kits, transfection and detection reagents, RNAi molecules, and LUX real-time PCR primers. The users of the online Vector NTI User Community reportedly are able to download software, manage licenses, review FAQs, and read through use cases. The user community is free for nonprofit researchers.
If we consider search and findability functions, what we provide to users is just the tip of the iceberg, comments Siamak Baharloo, Ph.D., eScience marketing manager. There is tremendous potential in providing search algorithms that are more attuned to customer needs. Invitrogen’s iPath tool allows just thatfinding relationships between the protein of interest and other components of a biological pathway. Users can access 225 iPath human regulatory and metabolic pathway maps by searching by a keyword or an accession number and also review and order Invitrogen’s products that are relevant for the analysis.
The International HapMap Project has developed a haplotype map of the human genome, which describes the common patterns of human DNA sequence variation and in particular, single nucleotide polymorphisms (SNPs). The HapMap is expected to be a key resource for finding genes affecting health, disease, and responses to drugs and environmental factors. The human genome is thought to contain over 10 million SNPs. By choosing tagSNPs that are proxies for SNPs that are grouped together, the time and cost of disease-associated studies is reduced.
Integrating Data to Study Health
Illumina (www.illumina.com) was one of the principal investigators, as well as a major contributor of the equipment and technology to the HapMap project. The company reports that it used its software algorithms and Infinium assay to reduce the millions of SNPs from the HapMap Project to 550,000 tagSNPs, while retaining 90% of genome coverage. Illumina’s Sentrix BeadChips contain hundreds of thousands of tagSNPs and can be used for large-scale genome interrogation studies, according to the company.
For genome-wide association studies, thousands of individual samples have to be tested on hundreds of thousands of markers. The goal is to use these data to find the handful of markers involved in disease, says Pauline Ng, Ph.D., senior bioinformatics scientists. Given the sheer volume of data generated, our LIMS plays an essential role in tracking multiple samples and data points.
Illumina utilizes a positive sample-tracking method, which means that each step is validated and checked independently. In addition, the company says that its database contains precise, annotated records of all data for each sample at each step, thus minimizing processing errors.