Although short read sequencing serves an important purpose in both the research and clinical arenas of genomic analysis, it is difficult to rely on short reads for some interrogations, such as structural variations (SVs) including large indels and base-level resolved copy number variations, to resolve phasing relationships or to generate highly contiguous de novo genome assemblies. For these, long-read sequencing technologies have overcome the limitations of short-reads.
It has been three years since University of California, Santa Cruz (UCSC), researchers proved that long-read human genome assembly can be done, using a nanopore-based technology developed on campus. The method was then improved upon, using the PromethION nanopore sequencer, shortening the time of the genome assembly to about a week.
Now, UCSC researchers have collaborated on an algorithm designed to accurately and precisely assemble individual, complete human genomes from long-read sequencing data in about six hours and for about $70.
The work is published in an article titled, “Nanopore sequencing and the Shasta toolkit enable efficient de novo assembly of eleven human genomes” in Nature Biotechnology. In the paper, the researchers described how Shasta not only yields comparable or better accuracy as its contemporaries but also has the lowest number of misassemblies.
To enable rapid human genome assembly, the researchers developed Shasta, a de novo long-read assembler, and polishing algorithms named MarginPolish and HELEN.
Shasta is an in-memory computing-driven algorithm that can help complete a de novo human genome assembly in under six hours, the authors said, for an average cost of $70 per sample.
They developed a nanopore long-read sequencing protocol that consistently yields ~60X coverage (~200 gigabases) of a human genome at unprecedented lengths (median read N50 of 42 kb) using three PromethION flow cells. Additionally, ~7X coverage of the genome is in reads exceeding 100 kb in length. This method is highly scalable, both in terms of cost and the number of genomes that can be processed simultaneously. Using a single PromethION nanopore sequencer, they assembled 11 highly contiguous human genomes de novo in nine days.
The authors noted that they are now “improving this method for higher read lengths and throughput, which will further facilitate our goal of achieving complete, phased, reference-quality genomes.” The researchers said they hope their assembler will increase the pace of genomics research and open opportunities. This includes enabling pangenome research to represent the true scale of human diversity, a decidedly more practical pursuit.
Until recently, genomic research has relied exclusively on the reference genome from a single individual selected to represent an entire species. To reflect true human diversity, UCSC has embarked on a pangenomic initiative to sequence 350 new, individual human genomes.
This large inflow of data necessitated the development of highly efficient software tools, starting with an assembler. “Our new assembler was designed to be cheap and quick, with the goal to be on the cloud,” said Benedict Paten, PhD, assistant professor, biomolecular engineering at UCSC. “It gives us the power to scale nanopore sequencing. Now, I’m confident that we’ll be easily assembling hundreds of de novo genomes in the next couple of years.”
“To improve the base-level quality of the assemblies, we used a sequence polisher based on a deep neural network as the final assembly step,” explained lead author Kishwar Shafin a graduate student at UCSC. “This brought the total cost of the assembly process to less than $200 and 37 hours—which further reduced the computational overhead of generating long-read assemblies dramatically—by a factor of five.”
The researchers assessed the precision and then validated the accuracy, and noted that they had achieved 99.9% accurate assembly using only nanopore data, a first for the human genome. Further, they generated chromosome-level scaffolds for these polished assemblies using HiC sequencing data.
Co-author Karen Miga, PhD, assistant research scientist at UCSC, pointed out the significance of the team’s achievements in improved accuracy. “Our aim is not only to expand the diversity of the reference genome but also to resolve the hundreds of gaps that persist across the genome,” Miga explained. “Now that we can routinely include these uncharted regions, we have a truly complete assembly of a human genome, and we can begin to explore variations of unknown consequence.”