Study published in Nature Biotechnology found that issues were amplified when comparing distantly related species.
Software used to align genomes from different species have quality-control issues, according to a group of researchers. This is especially true when comparing distantly related species and in regions of the genome that do not code for a protein.
“We discovered that there’s a disturbingly low level of agreement between genome alignments produced by different tools,” says corresponding author Martin Tompa, a professor of computer science and engineering and of genome sciences at the University of Washington. “What this should suggest to biologists is that they should be very cautious about trusting these alignments in their entirety.” Details are published online this week in Nature Biotechnology.
Aligning genomes, while simple in theory, is difficult in practice, Tompa points out. Aligning more than two sequences becomes much harder with every additional sequence. At the scale of a mammal’s entire genome, finding the optimal alignment of many genomes is far beyond the capabilities of any computer, he adds.
Various software tools instead use strategic shortcuts. “At a high level the tools are very similar,” Tompa notes. “They make different decisions at the lower, more detailed levels, and those decisions seem to have widespread effect on the outcome.”
Tompa compared the alignments from a previous study in which four research teams each took the same 1% of the human genome and aligned it to the genomes of 27 other vertebrate animals, ranging from mouse to elephant. “This is a marvelous dataset,” Tompa says. “It’s a very large-scale multiple sequence alignment, done by four expert teams using four different tools, all of them working on the same input sequences.” The four tools that were used were Pecan, Threaded Blockset Aligner, Multiple Limited Area Global Alignment of Nucleotides, and Mavid. All four are free programs developed by academic institutions.
However, the new study found that the resulting alignments were quite different. The authors also compared the coverage of each tool, meaning how much of the human DNA it was able to match to each of the other species as well as what fraction of alignments were suspiciously close to a random match.
The best-performing tool was the latest one, Pecan, developed by the European Bioinformatics Institute. “Our study pretty clearly points to Pecan as being the highest-quality alignment of the four tools we compared,” according to Tompa. It aligned as much of the human genome to other species as any of the other tools, and its matches were considerably more reliable, especially between more distantly related species, he explains.