A new study from the Salk Institute has identified over 2,000 new genes. They are known as small open reading frames, or smORFs, and they encode microproteins, which appear to participate in diverse functions including immune function, cell stress, and muscle development. According to Salk scientists, smORFs could have big implications. Specifically, they could lead scientists to new biomarkers and drug targets for human diseases.
Previously, scientists had identified about 25,000 genes that code for biologically important proteins. This total, however, does not include the diminutive smORFs, which may expand the number of genes by about 10%.
Despite their large numbers, smORFs and the microproteins they encode are hard to find. They stayed pretty much out of sight until a Salk team led by Thomas F. Martinez, PhD, postdoctoral fellow, and Alan Saghatelian, PhD, professor, developed a smORF-specific version of Ribo-seq, a proteogenomic technique.
Details appeared December 9 in Nature Chemical Biology, in an article titled, “Accurate annotation of human protein–coding small open reading frames.”
“Here, we integrate de novo transcriptome assembly and Ribo-seq into an improved workflow that overcomes obstacles with previous methods, to more confidently annotate thousands of smORFs,” the article’s authors wrote. “By including additional validation into our smORF annotation workflow, we accurately identify thousands of unannotated translated smORFs that will provide a rich pool of unexplored, functional human genes.”
Ribo-seq is routinely used for detecting the production of larger proteins, but it is less consistent for detecting smORFs. That is, ordinary Ribo-seq does a poor job of identifying which smORFS actually encode proteins in cells. So, the Salk team optimized Ribo-seq so that it more reliably detects smORFs and yields the most robust estimates of the number smORFs in the human genome.
This work was led by Martinez, one of the article’s corresponding authors. Subsequently, the smORF-optimized Ribo-seq technique was used to find smORFs in three human cell lines, taken from leukemia, ovarian cancer, and immortalized kidney cells. Around 7,500 smORFs showed up in at least one cell line. Of those, around 1,500 appeared in at least two cell lines—and kept showing up when the researchers repeated their experiments. The reproducibility of the results gave the researchers confidence that these newly spotted genes really existed.
“We finally have reliable information that the human genome contains at least 2,500 to 3,500 smORFs,” said Saghatelian.
The challenge now is to figure out which smORFs are involved in disease—and whether the microproteins they code for could be disease targets. Already, the researchers have identified around 500 smORFs that show up in all three cell lines, suggesting they could have important biological functions.
“Right now, our methods can tell us if a smORF exists or doesn’t exist, but it doesn’t give us a lot of information on what is actually related to disease,” noted Saghatelian. “Going forward, the lab will start doing more research to find smORFs that may be specific to diseases like cancer or diabetes.”
Saghatelian pointed out that the science of smORFs is still in its early days, so the researchers hope other labs around the world will use their methods to hunt for smORFs in their own cell lines.
“This is really an unexplored area,” said Martinez. “At the end of the day, you want to know what all the parts are in the genome.” Now that the parts that may be scrutinized include smORFs, scientists may achieve a better understanding of human biology. In addition, as suggested by Saghatelian, smORF-optimized Ribo-seq “may eventually have implications for diseases ranging from cancer to diabetes.”