In the middle of 2022, the world saw the first image from the Webb Space Telescope, a new cutting-edge instrument certain to unlock new secrets about the universe. That first infrared image showed a tiny sliver of a universe with thousands of galaxies with brightness and clarity never before possible.
Biology is full of these moments of clarity and brightness. The telescope image reminded me of that Sidney Brenner quote: “Progress in science depends on new techniques, new discoveries, and new ideas—probably in that order.” Brenner was just getting his scientific start when Rosalind Franklin’s Photograph 51 helped unveil the double helix. Over the next 70 years, our view of DNA went from a fuzzy, striated X to the telomere-to-telomere (T2T) genome, the most clear and complete picture of a human genome to date due to advances in sequencing technology.
Along the way, the discovery of RNA processing complicated the central dogma. Introns and alternative splicing meant that one gene could encode many RNA isoforms, and as RNA techniques evolved, we learned that most human genes encode more than one. Today, we think an estimated 95% of human genes undergo alternative splicing.
Splicing isn’t the only complication as the information flows from genome to transcriptome. Alternative segments can start or end a transcript from multiple transcription start sites and polyadenylation sites, respectively. Splicing isn’t always complete, resulting in retained or detained introns. Individual bases in the RNA can be edited to different ones.
These alternative possibilities can be multiplicative, such that a single gene can make thousands of isoforms, influenced by factors such as tissue type and temporal or developmental stage. As each gene’s transcription fires and a new RNA is born, these alternative choices can even affect one another. In turn, this remarkable diversity is often reflected in the proteome where alternative isoforms can tune the encoded protein’s properties, including activity, subcellular localization, and stability.
After the discovery of mRNA diversity, cataloging efforts set out to annotate all isoforms in humans and mice by cloning full-length cDNAs and sequencing cloned isoforms individually. This was far from complete since the labor and cost prohibited a survey of every tissue type across many individuals. Then entered the more affordable telescope for RNA gazing, short-read RNA-seq, which in turn enabled the single-cell and spatial RNA-seq revolution of today, which looks closer at where and when RNAs are made.
Ironically, the short-read revolution blurred our view of mRNAs in some respects versus that era of sequencing individual cDNA clones. Borrowing a play from shotgun genomics, short-read transcriptomic approaches blast apart RNA transcripts into small pieces. Then, if isoform inference is the goal, reassembling the original transcript isoform happens in a computer. Yet, many isoforms are impossible to reassemble back to the single mRNA molecules born in the cell. This all comes back to those alternative possibilities that muddle our sought-after clarity.
Consider an example: neurexin-2-a (NRXN2), a gene encoding a family of presynaptic transmembrane proteins in the brain. With multiple promoters, alternative splice site choices, and polyadenylation sites, the possible NRXN2 isoforms number in the thousands. Taking a brain RNA sample and blasting it apart into short sequences makes it impossible to assemble the individual mRNA isoforms, since these alternative possibilities span over thousands of bases.
The path to clarity is the same as what delivered the T2T genome—single-molecule, long-read sequencing. Now, the telescope of choice for adding isoform annotations to new genomes for consortiums such as the T2T Consortium and the Vertebrate Genome Project is isoform sequencing, or Iso-Seq, from PacBio. The long readlengths and the high accuracy erase the isoform assembly conundrum. The Genotype-Tissue Expression (GTEx) Consortium, harboring one of the largest short-read human data sets, also added long reads to get a better picture of the biology. After all, the cells and tissues synthesize isoforms, not genes. Why assemble RNA transcripts when you don’t have to?
Some may argue that the importance of these alternative transcripts and of understanding mRNA isoforms from end to end is overblown, yet even the short-read evidence demonstrates that there’s new clarity and discovery ahead with long-read transcriptomics.
In 2019, reanalysis of 18,468 cancer and normal RNA-seq datasets showed that the possibility of alternative promoters is a major factor driving transcripts in every type of cancer analyzed. Fast forward a few years, and a study using the PacBio Iso-Seq method to characterize gastric cancer mRNAs found the same theme. New promoters rather than those in annotation databases drive transcription of many key genes in the disease. Fast forward again, and a study using Iso-Seq reads for the deep characterization of breast cancer samples uncovered many novel isoforms and even doubled the tally for isoforms produced by known oncogenes compared to prior annotations.
But are all these isoforms important? Maybe not. Yet all three of those cancer studies made a discovery that justifies pointing our best telescope at RNA: alternative isoform expression is common, and some novel isoforms correlate with patient survival.
It has been a decade since PacBio introduced long-read RNA-seq technology. Ever since, long reads have enabled scientists to obtain otherwise unavailable insights in applications such as the identification of gene fusions; the definition of transcripts from nearly identical paralogs, immune genes, or repeat-rich RNAs; the recognition of haplotype-specific isoforms; and the study of genetic disorders. In the future, long reads will enable additional applications and generate many more insights, especially now that long-read technology is far more accessible and affordable.
PacBio recently introduced a new long-read system called Revio. It is designed to dramatically increase throughput and lower costs while leveraging PacBio’s HiFi sequencing technology, which offers greater accuracy than short-read technology. It is easy to recognize why accuracy is important for RNA analysis given a mature mRNA’s next destination, the ribosome. Indeed, several studies have shown that Iso-Seq reads help to generate sample-specific protein predictions, improving the utility of mass spectrometry proteomics.
A single-cell Iso-Seq method is poised to combine the single-cell revolution with the superior resolution of long-read transcriptomics. The recently released MAS-Seq method will usher in a dramatic change in what is possible. Most single-cell methods capture only a tiny portion of each transcript by short-read sequencing, enabling gene-level counting. Isoform analysis is more comprehensive.
Prior studies already found that isoform-level information unlocks new differences between neighboring cells. And now, with the MAS-Seq method, which concatenates cDNAs for even higher throughput, every single-cell biologist is granted the wish of more reads and lower costs. New discoveries in single-cell biology will be unlocked from this isoform-level resolution, and spatial transcriptomics approaches will surely benefit as well.
Seeing a human genome from end-to-end was a milestone that in many ways is akin to that first Webb telescope image. Biology will have many more comparable moments. Today, the space telescope is aimed at a new spot in the universe, while back on Earth, work is underway to examine human genomes from diverse ancestries with the same level of clarity and completeness.
Will this work reveal differences that affect the genes and isoforms? There are already clues that the answer is yes. For example, recent analyses of data from the GTEx Consortium show that ancestry influences alternative splicing more than overall expression level. Beyond cancer, other diseases influence which possibilities are realized as well. Whether we are healthy or sick, the RNA universe within each of us is ready for a clearer look.
Jason Underwood, PhD, is a principal scientist at PacBio.