Long-read platforms that can sequence RNA molecules over 10,000 bases in length end-to-end hold great potential for use in characterizing variations in the transcriptome. But while these technologies do not require RNA molecules to be broken up before they are sequenced, they do exhibit a much higher per-base error rate—typically between 5% to 20%—than short-read technologies. This limitation has severely hampered the widespread adoption of long-read RNA sequencing. In particular, the high error rate has made it difficult to determine the validity of novel, previously unknown RNA molecules discovered in a particular condition or disease.

Researchers at Children’s Hospital of Philadelphia (CHOP) have now developed a computational tool, called ESPRESSO (Error Statistics PRomoted Evaluator of Splice Site Options), that can more accurately discover and quantify RNA molecules from these error-prone long-read RNA sequencing data, without relying on short-read RNA-seq data. They suggest that the new tool could enable better diagnosis of rare genetic diseases caused by disrupted RNA and the discovery of potential therapeutic targets in diseases like cancer.

“ESPRESSO addresses a long-standing problem of long-read RNA sequencing and could usher in new opportunities of discovery,” said Yi Xing, PhD, director of the Center for Computational and Genomic Medicine at CHOP and senior author of the team’s study in Science Advances. “We envision that ESPRESSO will be a useful tool for researchers to explore the RNA repertoire of cells in various biomedical and clinical settings.”

Xing and colleagues describe the development of ESPRESSO and its evaluation, in a paper titled, “ESPRESSO: Robust discovery and quantification of transcript isoforms from error-prone long-read RNA-seq data,” in which they concluded, “ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.”

On the journey from gene to protein a nascent RNA molecule can be cut and joined, or spliced, in different ways—creating different RNA isoforms—before being translated into a protein. This process, known as alternative splicing, allows a single gene to encode several different proteins. Alternative splicing occurs in many biological processes, such as when stem cells mature into tissue-specific cells. In the context of disease, however, alternative splicing can be dysregulated. Therefore, it is important to examine the transcriptome—that is, all the RNA molecules that might stem from genes—to understand the root cause of a disorder. “Switches between transcript isoforms and their underlying RNA processing events occur in many biological processes, such as cellular differentiation, and are known to be dysregulated in the context of human diseases, including cancer, the authors commented. “Consequently, it is important to examine the transcriptome diversity of cells not only at the gene level but also at the isoform level.”

However, historically it has been difficult to read RNA molecules in their entirety because they are usually thousands of bases long. Instead, researchers have relied on so-called short-read RNA sequencing, which breaks RNA molecules and sequence them in much shorter pieces—somewhere between 200 to 600 bases, depending on the platform and protocol. Computer programs are then used to reconstruct the full sequences of RNA molecules. “… short-read RNA sequencing (RNA-seq) has become a widely used approach for profiling eukaryotic transcriptomes, and numerous tools have been developed and optimized to analyze short-read RNA-seq data,” the team continued.

Short-read RNA sequencing can give highly accurate sequencing data, with a low per-base error rate of approximately 0.1% (meaning one base is incorrectly determined for every 1,000 bases sequenced). Nevertheless, it is limited in the information that it can provide due to the short length of the sequencing reads. In many ways, short-read RNA sequencing is like breaking a large picture into many jigsaw pieces that are all the same shape and size and then trying to piece the picture back together. As the investigators noted in their paper, “… despite having high sequencing quality and throughput, short-read RNA-seq is inherently limited in its ability to discover and quantify transcript isoforms because its limited read lengths often cannot cover more than one splice junction (SJ), let alone full-length transcripts.”

Recently, “long-read” platforms that can sequence RNA molecules over 10,000 bases in length end-to-end have become available. “… rapidly developing single-molecule long-read RNA-seq technologies are capable of generating reads longer than 10 kb, which can span the entirety of almost all eukaryotic transcripts, and therefore have emerged as a potentially powerful solution to analyzing transcriptome variation at the isoform level,” the scientists noted. But while such platforms do not require RNA molecules to be broken up before they are sequenced, they have a much higher per-base error rate, typically between 5% and 20%. This well-known limitation has severely hampered the widespread adoption of long-read RNA sequencing. In particular, the high error rate has made it difficult to determine the validity of novel, previously unknown RNA molecules discovered in a particular condition or disease.

“Long-read RNA sequencing is a powerful technology that will allow us to uncover RNA variation in rare genetic diseases and other conditions, like cancer,” said Xing. “We are probably at an inflection point in how we discover and analyze RNA molecules. The transition from short-read to long-read RNA sequencing represents an exciting technological transformation, and computational tools that reliably interpret long-read RNA sequencing data are urgently needed.” The authors continued, “Given the increasingly broad adoption of long-read RNA-seq technologies and the rapid accumulation of error-prone long-read RNA-seq data in public repositories, there is an urgent need to develop robust computational tools for transcript isoform discovery and quantification using error-prone long-read RNA-seq data alone.”

The CHOP team’s newly developed ESPRESSO tool has been designed to enable accurate discovery and quantification of RNA isoforms using error-prone long-read RNA sequencing data alone. To achieve this, the computational tool compares all long RNA sequencing reads of a given gene to its corresponding genomic DNA, and then uses the error patterns of individual long reads to confidently identify splice junctions—places where the nascent RNA molecule has been cut and joined—as well as their corresponding full-length RNA isoforms.

By finding areas of perfect matches between long RNA sequencing reads and genomic DNA, as well as borrowing information across all long RNA sequencing reads of a gene, the tool can identify highly reliable splice junctions and RNA isoforms, including those that have not been previously documented in existing databases. “Therefore, ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses the error profiles of individual reads to improve the identification of SJs and quantification of transcript isoforms,” the scientists explained. “The core innovation of ESPRESSO lies in its ability to correct putative SJs found in individual long reads by borrowing information from other long reads aligned to the same genomic region.”

They evaluated the performance of ESPRESSO using simulated data and data on real biological samples. They found that ESPRESSO performed better than multiple currently available tools, both in terms of discovering RNA isoforms and quantifying them. The researchers also generated and analyzed over one billion long RNA sequencing reads covering 30 human tissue types and three human cell lines, providing a useful resource for studying human transcriptome variation at the resolution of full-length RNA isoforms.

“Given the increasingly wide adoption of long-read RNA-seq in biomedical research, we envision that ESPRESSO will be a useful tool for researchers to explore the RNA repertoire of eukaryotic cells in diverse settings,” the authors concluded.