Genome deposits in public archives may be less reliable than researchers imagine. Even though the genomes that are submitted to public archives are scrutinized for erroneous sequences, mistakes can slip through. These mistakes include instances of cross-species contamination.

According to a study by scientists at Johns Hopkins University, a public database, GenBank, contains draft assemblies of animal and plant genomes that have been contaminated by bacterial and viral sequences. In particular, the scientists learned that a draft assembly of domestic cow, Bos taurus, contained 173 small contigs that appeared to derive from microbial contaminants. Surprised by this finding, the scientists delved deeper and discovered the presence of cow and sheep DNA in the supposedly finished genome of a pathogenic bacterium, Neisseria gonorrhoeae.

These findings appeared November 18 in the journal PeerJ, in an article entitled, “Unexpected cross-species contamination in genome sequencing projects.”

The researchers assert that their findings illustrate the need to carefully validate findings of anomalous DNA that rely on comparisons to either draft or finished genomes. They are particularly concerned that anomalous DNA could cause problems for the rapidly growing field of microbiome analysis.
“The accuracy of microbiome analysis is critically dependent on the accuracy of the previously sequenced microbial genomes,” wrote the authors. “The vast majority of these sequences are accurate, but any errors may be amplified by efforts to search for the presence of unusual or unexpected species.”

The authors added that contamination from other species may masquerade as lateral gene transfer, an event that is relatively common between some bacteria but extremely rare otherwise. In the case of Neisseria gonorrhoeae, the authors noted, erroneous DNA sequences could be mistaken as evidence of lateral gene transfer. The correct explanation is, of course, more mundane.

“Throughout the process of DNA isolation and sequencing, contamination remains a possibility. Computational filters applied to the raw sequencing reads are usually effective at removing common laboratory contaminants such as E. coli, but other contaminants may be more difficult to identify,” the authors cautioned.

“If scientists cannot assume that the sequence of a species truly comes from that species, then analyses that use this data may be fundamentally flawed,” the authors concluded. “These findings highlight the importance of careful screening of DNA sequence data both at the time of release and, in some cases, for many years after publication.”