The Sequence Analysis Component Collection contains over 70 bioinformatics algorithms and utilities available for drag-and-drop data-pipeline creation. The example in Figure 1 illustrates some simple components of the Sequence Analysis Component Collection and the versatility of using algorithms of different sources and codes bases.
This component collection is not intended to be an all-encompassing set of bioinformatics components, but a fundamental set of tools for sequence data analysis. To add in-house or third-party algorithms as components, integration tools are available that make wrapping algorithms a straight-forward process, typically by using Perl/BioPerl, Java/BioJava, and Python adaptors. Additionally, remote execution of algorithms on different severs can be achieved using SOAP and Run Program components via FTP, TELNET, SSH, or SCP.
The end result is a large collection of algorithms from diverse sources, represented as interoperable components. This algorithm integration provides the bench scientist with a consistent and flexible environment to create, test, and modify data-processing pipelines.
Data integration continually poses a problem in biology. In spite of efforts to create data standards, effective data-processing pipelines must have the ability to read a variety of data formats. Pipeline Pilot achieves data integration with the adaptation of existing algorithms and code to work with the Pipeline Pilot data-object model. With this unification, it becomes possible to integrate different types of data within a single data object. Data integration occurs when reading in different data of different formats and also occurs as inputs and outputs are converted to and from other software code objects within components, for example converting BioPerl or BioJava objects dynamically to and from the Pipeline Pilot object.
Figure 1 shows not only the integration of file formats, but as each data record is passed from component to component, the data object integrates the results of each component as it progresses.
Data integration can also occur at a higher level of organization. Components of the Reporting Collection in Pipeline Pilot are used to create reports that contain text, charts, and graphs and are typically used to condense and summarize information about a dataset that has passed through the data-processing pipeline (Figure 2 and 3). These reports are also easily customizable, and the content can be updated in real-time. Because they are easily manipulated, the use of the reporting collection can negate the need for ad-hoc spreadsheets, alleviating the scientist of significant data manipulation.