Producing reference genomes for all known eukaryotic species (~1.8 million) over the next decade is a daunting task. But the Earth BioGenome Project is hoping to do exactly that. Two separate projects—the Vertebrate Genomes Project (VGP) and the European Reference Genome Atlas (ERGA)—came together to develop a tool that may help with the Herculean task.
The group created a pipeline that combines PacBio high-fidelity (HiFi) reads with maps (either Hi-C or optical) to generate nearly complete assemblies. This is all within the Galaxy ecosystem—open sourced software that allows users to execute complicated workflows on thousands of datasets and large amounts (terabytes) of data. The researchers developed novel algorithms and computer software that cut the sequencing time from months to days.
The team of researchers then mapped the genomes of 51 species including cats, dolphins, kangaroos, penguins, sharks, and turtles—prioritizing those animals that are useful models for understanding human evolution—making discoveries that deepen our understanding of evolution and the links between humans and animals.
This work is published in Nature Biotechnology in the paper, “Scalable, accessible and reproducible reference genome assembly and evaluation in Galaxy.”
“Being able to access that genetic information will have huge implications for understanding human health and evolution,” said Michael Schatz, PhD, professor of computer science and biology at Johns Hopkins University. “A lot of work on drug compounds starts in mice and other animal models, so understanding their genomes and the genomes of other animals directly benefits us.”
Mammals share 50% to 99% of the same DNA and nearly all the genes from a common ancestor that lived roughly 200 million years ago. By comparing the complete genomes of these species, researchers can start to identify when and where DNA sequences diverged and the implications of those differences for humans. But, researchers say, this work has been limited by the number and quality of vertebrate genomes available, which has focused on a few key species.
“Have you ever done a massive jigsaw puzzle where at some point all that’s left is blue sky, and you don’t think you’ll ever be able to fit the right pieces together? The old software would basically give up on these hard parts of the genome. That’s the problem with genome assembly,” Schatz said. “Our new program, using the latest sequencing data and the latest assembly algorithms, knows how to work through those parts to get a more complete picture.”
To test their technology, researchers mapped the genome of the zebra finch, a songbird that had already been sequenced to study brain development. The new technology was far better at reassembling segments of the genome, creating a more accurate and complete map.
Making the open-source software available online via Galaxy, a web-based platform, makes it free to the public. “In the past, only a handful of elite research groups would have had access to the resources needed to assemble these genomes. Now, anyone on the planet with access to the internet can visit the website and, with a few clicks of the button, run multiple scientific tools,” said Alex Ostrovsky, a Johns Hopkins software engineer on the Galaxy team who was responsible for making the tools easy to use for noncoders.
The team will continue working with the Vertebrate Genomes Project to sequence the genomes of at least one species across all 275 vertebrate orders.
“In some ways, we’re building an evolutionary time machine,” Schatz said. “We can trace how vertebrates evolved over time and eventually gave rise to genes and sequences that are uniquely found in humans. Having the genes of our evolutionary cousins mapped out will help us better understand ourselves.”