Researchers at Pennsylvania State University report that they have found that essentially all coding and noncoding RNA originates at the same types of locations along the human genome.
The scientists believe their work is especially intriguing because noncoding RNA, aka “genomic dark matter,” is not involved in making proteins but still comprises more than 95% percent of the human genome. They think their findings eventually may help to pinpoint exactly where complex-disease traits reside since the genetic origins of many diseases reside outside the coding region of the genome.
The Penn State group published their study (“Genomic organization of human transcription initiation complexes”) as an advance online publication in this week’s Nature.
B. Franklin Pugh, Ph.D., who holds the Willaman Chair in Molecular Biology at Penn State, and postdoc Bryan Venters, Ph.D., who now holds a faculty position at Vanderbilt University, set out to identify the precise location of the beginnings of transcription.
“The human genome is pervasively transcribed, yet only a small fraction is coding,” wrote the investigators. “Widespread coding and noncoding transcription across the human genome arises from discrete transcription initiation complexes assembled at four core promoter elements [BREu, TATA, BREd, and INR, in constrained positions].”
However, Dr. Pugh said that, in their quest to learn just where transcription begins, other scientists had looked directly at RNA. But he and Dr. Venters instead determined where along human chromosomes the proteins that initiate transcription of the noncoding RNA were located.
“We took this approach because so many RNAs are rapidly destroyed soon after they are made, and this makes them hard to detect,” explained Dr. Pugh said. “So rather than look for the RNA product of transcription we looked for the ‘initiation machine’ that makes the RNA. This machine assembles RNA polymerase, which goes on to make RNA, which goes on to make a protein.”
Dr. Pugh added that he and Dr. Venters were amazed to find 160,000 of these initiation machines, because humans only have about 30,000 genes. “This finding is even more remarkable, given that fewer than 10,000 of these machines actually were found right at the site of genes,” continued Dr. Pugh. “Since most genes are turned off in cells, it is understandable why they are typically devoid of the initiation machinery.”
The remaining 150,000 initiation machines, i.e., those Drs. Pugh and Venters did not find right at genes, remained somewhat mysterious.
“These initiation machines that were not associated with genes were clearly active since they were making RNA and aligned with fragments of RNA discovered by other scientists,” noted Dr. Pugh. “In the early days, these fragments of RNA were generally dismissed as irrelevant since they did not code for proteins.”
He also pointed out that it was easy to dismiss these fragments because they lacked a feature called polyadenylation (a long string of genetic material, adenosine bases) that protect the RNA from being destroyed. Drs. Pugh and Venters further validated their surprising findings by determining that these noncoding initiation machines recognized the same DNA sequences as the ones at coding genes, indicating that they have a specific origin and that their production is regulated, just like it is at coding genes.
“These noncoding RNAs have been called the ‘dark matter’ of the genome because, just like the dark matter of the universe, they are massive in terms of coverage [of the human genome]. However, they are difficult to detect and no one knows exactly what they all are doing or why they are there,” said Dr. Pugh. “Now at least we know that they are real, and not just ‘noise’ or ‘junk.’ Of course, the next step is to answer the question, ‘what, in fact, do they do?’”
Dr. Pugh thinks the implications of this research could represent one step towards solving the problem of “missing heritability,” which is a concept that describes how most traits, including many diseases, cannot be accounted for by individual genes and seem to have their origins in regions of the genome that do not code for proteins.