New data reveals that at least 80% of the human genome encodes elements that have some sort of biological function.
Far from containing vast amounts of junk DNA between its protein-coding genes, at least 80% of the human genome encodes elements that have some sort of biological function, according to newly released data from the Encyclopedia of DNA Elements (Encode) project, a five-year initiative that aims to delineate all functional elements within human DNA. The massive international project, data from which are published in 30 different papers in Nature, Genome Research, Genome Biology, the Journal of Biological Chemistry, Science, and Cell, has identified four million gene switches, effectively regulatory regions in the genome where proteins interact with the DNA to control gene expression.
Overall, the Encode data define tens of thousands of genes and hundreds of thousands of regulatory switches that are scattered all over the three billion nucleotides of the genome. In fact, the data suggests, the regions that lie between gene-coding sequences contain a wealth of previously unrecognized functional elements, including nonprotein-coding RNA transcribed sequences, transcription factor binding sites, chromatin structural elements, and DNA methylation sites. The combined results suggest that 95% of the genome lies within 8 kb of a DNA-protein interaction, and 99% lies within 1.7 kb of at least one of the biochemical events, the researchers say.
Importantly, given the complex three-dimensional nature of DNA, it’s also apparent that a regulatory element for one gene may be located quite some ‘linear’ distance from the gene itself. “The information processing and the intelligence of the genome reside in the regulatory elements,” explains Jim Kent, director of the University of California, Santa Cruz Genome Browser project and head of the Encode Data Coordination Center. “With this project, we probably went from understanding less than 5% to now around 75% of them.”
The Encode results also identified SNPs within regulatory regions that are associated with a range of diseases, providing new insights into the roles that noncoding DNA plays in disease development. “As much as nine out of 10 times, disease-linked genetic variants are not in protein-coding regions,” comments Mike Pazin, Encode program director at the National Human Genome Research Institute. “Far from being junk DNA, this regulatory DNA clearly makes important contributions to human disease.”
The NHGRI-sponsored Encode project included hundreds of researchers from across the U.S., U.K., Spain, Singapore, and Japan, who analyzed the genomes of 147 different types of tissue. The initiative generated over 15 trillion bytes of raw data, more than 1,500 datasets, and used the equivalent of over 300 years of computer time. The likelihood that it would trash the premise that ‘most of the human genome contains Junk DNA’ became apparent when data from a pilot phase of the project that looked at just 1% of the genome were published in 2007.
Having completed their analysis of the whole genome, the Encode researchers now summarize their methodologies and results in a paper in Nature titled “An integrated encyclopedia of DNA elements in the human genome.” Five additional papers in Nature and 24 associated papers in other journals provide more detailed contextual ‘themed’ results. Reporting in one of these papers in Science, a team led by University of Washington researchers detail their studies using DNAse1 and massively parallel sequencing to create comprehensive maps of all regulatory DNA in many cell types. The maps were analyzed to provide data to help identify connections between disease-associated genetic variations and specific regulatory regions.
As well as finding that some 76% of disease-associated variants in nongene regions are located within or linked to regulatory DNA, the data suggest that many seemingly unrelated diseases share common regulatory circuitry. “Genes occupy only a tiny fraction of the genome, and most efforts to map the genetic causes of disease were frustrated by signals that pointed away from genes,” comments John A. Stamatoyannopoulos, Ph.D., associated professor of genome sciences and medicine at the University of Washington. “Now we know that these efforts were not in vain, and that the signals were in fact pointing to the genome’s operating system—the instructions for which are hidden in millions of locations around the genome.” Incredibly, Dr. Stamatoyannopoulos’ team separately found, over 90% of the millions of protein-docking regulatory elements they mapped are slight variants of just 683 different DNA sequences.
In one of the Nature papers, Yale University’s Mark Gerstein, Ph.D., and colleagues report on their work to trace the cascade of a half-million molecular interactions triggered by 119 transcription factors. Their resulting model indicates that these transcription factors are wired together in a hierarchical fashion, with some factors operating like top-level executives, and some as middle managers or shop foremen. Together they regulate the 20,000 or so genes in the human genome.
This hierarchical structure creates information-flow bottlenecks at the level of the “middle managers,” which Gerstein’s team showed work together to more efficiently regulate target genes and ease the bottlenecks. This means that the human genome is organized much more democratically than say, the top-down command system of the military, Gerstein says.
However, the “executive-level” transcription factors do tend to have the most influence in key functions such as driving gene expression, and also have better connections with other genes in different molecular networks. Attesting to their importance to survival, these “executives” tend to be more conserved across populations.
“The Encode catalog is like Google Maps for the human genome,” says Elise Feingold, a program director at the NHGRI who helped to start the Encode Project. “The Encode maps allow researchers to inspect the chromosomes, genes, functional elements, and individual nucleotides in the human genome in much the same way.”
All Encode data are being made freely available, providing a resource for genome analysis and interrogation at a scale never before achieved. “Encode data can be used by any disease researcher, whatever pathology they may be interested in,” adds Ian Dunham, at the EMBL-European Bioinformatics Institute, which co-led the project with NHGRI. “In many cases you may have a good idea of which genes are involved in your disease, but you might not know which switchers are involved. Encode gives us a set of very valuable leads to follow to discover key mechanisms at play in health and disease. Those can be exploited to create entirely new medicines, or to repurpose existing treatments.”
The full Encode consortium datasets can be freely accessed through the Encode project portal as well as at the University of California at Santa Cruz genome browser, the National Center for Biotechnology Information, and the European Bioinformatics Institute.