January 1, 2011 (Vol. 31, No. 1)
Firm Offers an Array of Bioinformatics Tools for Analyzing Genomic Information
The bioinformatics tools offered by Denmark’s CLC Bio are like “a Swiss Army knife for genomic data analysis,” says Thomas Knudsen, CEO. The company develops and markets software for the analysis of high-throughput sequencing (HTS) data and says its bioinformatics algorithms are wrapped in a user-friendly graphical interface.
Brothers Bjarne Knudsen, Ph.D., and Thomas Knudsen started CLC Bio in 2005 by offering a free software program known as CLC Sequence Viewer. Dr. Knudsen, a bioinformatics expert, created the technical aspects of the software, while Thomas Knudsen serves as CEO.
The Sequence Viewer software remains free and can be downloaded from the company’s website. In the first year, 100,000 downloads occurred, and by 2008, the number of downloads passed one million. When first launched, the software was “a powerful and intuitive way to show people how to do bioinformatics,” says Thomas Knudsen. Although designed for first-generation sequence data, he believes that the software makes a good teaching tool for students in molecular biology.
In 2007, the company switched its focus to the analysis of next-generation sequencing data. The CLC Genomics Workbench, released in 2008, analyzes data from second-generation HTS instruments. Whereas first-generation sequencing machines typically generate 0.1 megabases of data per run, second-generation instruments can spew out up to 40,000 megabases in a single run, explains Knudsen. In fact, the amount of genomic sequencing data increases 10-fold every 18 months. “So there’s a critical demand for solutions that are really adept at handling and analyzing these huge amounts of data,” he says.
The CLC Genomics Workbench is a comprehensive package that analyzes and visualizes data from all major next-generation HTS platforms, such as SOLiD by Applied Biosystems, 454 GSflx by Roche, and Solexa by Illumina. When first launched, “no other companies were doing this,” says Knudsen. “We had a head start in the market, making us a premiere solution provider.”
Users of CLC Bio’s software are not locked into a single platform, but can use any or all HTS machines. Because different sequencing instruments offer different advantages, it makes sense to mix datasets into hybrid assemblies, Knudsen notes. This overlying strategy extends to software in development to handle the hundreds of thousands of reads generated by upcoming third-generation sequencers. “A key to our success is that our customers can mix data, and that will continue with new platforms,” he says.
Early in 2010, the company released version 2.0 of the CLC Genomics Server, an enterprise platform for next-generation sequencing data analysis. CLC Bio describes the Genomics Server a bioinformatics solution built on a three-tier server architecture. The company says that the server provides flexible options for executing centralized services, easy integration with other applications and services, powerful database communication and data integration, and a secure access control framework and central-action logging.
Version 2.0 of the CLC Genomics Server includes a wider range of features for handling HTS data, says Knudsen. He notes some key improvements to this version, including capabilities for parallel job executions on multiple computers through multiple job nodes, integration of third-party command-line tools and algorithms, support for file sharing and data management, and additional HTS analyses such as digital gene expression for RNA sequencing, SNP and DIP detection, and ChIP-seq analysis.
Improving on Intel
In March 2010, CLC Bio released a de novo assembler that constructs whole genomes of any size, including human and plant genomes, on a single workstation computer. CLC Bio says that its de novo assembler algorithm runs 50 times faster than existing products, and deciphers complete datasets in just a few hours.
The company also adds that its assembler requires 48 gigabytes of RAM compared to others that need 300 gigabytes of RAM. The software engineers at CLC Bio accomplished this by creating new data-compression algorithms to take advantage of computing power inherent in Intel microprocessors that generally lies dormant.
All Intel microprocessors contain the MMX™ technology that runs many calculations in parallel. The MMX technology, added in 1996, was intended for handling complex graphics but never caught on. However, “our skilled computer programmers realized it was an ideal technology for bioinformatics, which also runs lots of parallel calculations,” says Knudsen. The company then harnessed this built-in function to speed bioinformatics calculations. “If you use our algorithm, you can crunch a lot more data,” he says.
CLC Bio’s de novo assembler works through an intuitive, user-friendly graphical as well as a command-line interface, according to Knudsen. The de novo assembler also combines datasets generated by different HTS instruments including those sold by Illumina, Roche, and Applied Biosystems.