A team of researchers at Massachusetts Institute of Technology (MIT) has published what it claims is the most comprehensive map yet of human noncoding DNA. The EpiMap (Epigenome Integration across Multiple Annotation Projects) resource offers an in-depth annotation of epigenomic marks—modifications indicating which genes are turned on or off in different types of cells—across 833 tissues and cell types. This represents a significant increase over prior coverage, the team said.
The researchers also identified groups of regulatory elements that control specific biological programs, and uncovered candidate mechanisms of action for about 30,000 genetic variants linked to 540 specific traits. They have made all of their data publicly available for use by the broader scientific community.
“What we’re delivering is really the circuitry of the human genome,” said Manolis Kellis, PhD, a professor of computer science, a member of MIT’s Computer Science and Artificial Intelligence Laboratory and of the Broad Institute of MIT and Harvard. Kellis is senior author of the team’s published report in Nature. He added, “Twenty years later, we not only have the genes, we not only have the noncoding annotations, but we have the modules, the upstream regulators, the downstream targets, the disease variants, and the interpretation of these disease variants.”
MIT graduate student Carles Boix is the first author of the paper, which is titled, “Regulatory genomic circuitry of human disease loci by integrative epigenomics.” Co-authors include MIT graduate students Benjamin James, and former MIT postdocs Yongjin Park, PhD, and Wouter Meuleman, PhD, who are now principal investigators at the University of British Columbia and the Altius Institute for Biomedical Sciences, respectively.
Twenty years ago this month the first draft of the human genome was publicly released, the MIT team pointed out. One of the biggest surprises to emerge from the project was that only 1.5% of the human genome consists of protein-coding genes, and it has since become apparent that the noncoding stretches of DNA, originally dubbed “junk DNA,” play critical roles in development and gene regulation.
Layered atop the sequence of nucleotides that makes up the genetic code of the human genome is the epigenome. The epigenome consists of chemical marks that help to determine which genes are expressed at different times, and in different cells. These marks include histone modifications, DNA methylation, and how accessible a given stretch of DNA is. “Epigenomics directly reads the marks used by our cells to remember what to turn on and what to turn off in every cell type, and in every tissue of our body, Kellis said. “They act as post-it notes, highlighters, and underlining. Epigenomics allows us to peek at what each cell marked as important in every cell type, and thus understand how the genome actually functions.”
Mapping these epigenomic annotations can reveal genetic control elements, and the cell types in which different elements are active. These control elements can be grouped into clusters or modules that function together to control specific biological functions. Some of these elements are enhancers, which are bound by proteins that activate gene expression, while others are repressors that turn genes off.
The new EpiMap derived by the MIT team builds on and combines data from several large-scale mapping consortia, including ENCODE, Roadmap Epigenomics, and Genomics of Gene Regulation. The researchers assembled a total of 833 biosamples, representing diverse tissues and cell types, each of which was mapped with a slightly different subset of epigenomic marks, making it difficult to fully integrate data across the multiple consortia. They then filled in the missing datasets, by combining available data for similar marks and biosamples, and used the resulting compendium of 10,000 marks across 833 biosamples to study gene regulation and human disease. “The resulting compendium of 833 high-quality reference epigenomes, grouped into 33 tissue categories, represents a major increase in biological space coverage, with 75% (624 of 833) of biosamples corresponding to new biological specimen,” the investigators claimed.
The researchers also annotated more than two million enhancer sites, covering only 0.8% of each biosample, and collectively 13% of the genome. “Our high resolution enhancer annotations provide a highly concentrated view of the noncoding landscape, yielding many gene-regulatory insights but covering only 0.8% of the genome in each sample, and only 13% across all samples,” they wrote.
They grouped them into 300 enhancer modules—“… including 290 tissue-specific modules (1.8 million enhancers, 88% of enhancers cumulatively, active in 2% of biosamples on average) and 10 broadly active modules (251,079 enhancers, 12% of enhancers, active across 77% of sample categories on average)—based on their activity patterns, and linked them to the biological processes they control, the regulators that control them, and the short sequence motifs that mediate this control.” The researchers also predicted 3.3 million links between control elements and the genes that they target based on their coordinated activity patterns, representing the most complete circuitry of the human genome to date. “Our linking revealed the high number of enhancers that control each gene and the high tissue specificity of long-range enhancer–gene links.”
Since the final draft of the human genome was completed in 2003, researchers have performed thousands of genome-wide association studies (GWAS), revealing common genetic variants that predispose their carriers to a particular trait or disease. These studies have yielded about 120,000 variants, but only 7% of these are located within protein-coding genes, leaving 93% that lie in regions of noncoding DNA. “Genome-wide association studies (GWAS) have been successful in discovering more than 100,000 genomic loci that contain common single-nucleotide polymorphisms (SNPs) associated with complex traits and disease-related phenotypes, providing a very important starting point for the systematic investigation of the molecular mechanism of human disease,” the authors noted. “… the vast majority of these genetic associations remain devoid of any mechanistic hypothesis underlying their molecular and cellular functions, as more than 90% lie outside protein-coding exons and probably have noncoding roles in gene-regulatory regions with circuitry that remains unresolved.”
There are many reasons as to why it’s hard to resolve how noncoding variants act. First, genetic variants are inherited in blocks, making it difficult to pinpoint causal variants among dozens of variants in each disease-associated region. Moreover, noncoding variants can act at large distances, sometimes millions of nucleotides away, making it difficult to find their target gene of action. They are also extremely dynamic, making it difficult to know which tissue they act in. Finally, understanding their upstream regulators remains an unsolved problem.
For their newly reported work, the researchers were able to address these questions and provide candidate mechanistic insights for more than 30,000 of these noncoding GWAS variants. The researchers found that variants associated with the same trait tended to be enriched in specific tissues that are biologically relevant to the trait. For example, genetic variants linked to intelligence were found to be in noncoding regions active in the brain, while variants associated with cholesterol level are in regions active in the liver. “Our epigenomic enrichments and enhancer–gene links yielded new biological insights on disease loci, with many compelling examples,” they stated.
The researchers also showed that some traits or diseases are affected by enhancers active in many different tissue types. For example, they found that genetic variants associated with coronary heart disease (CAD) were active in adipose tissue, coronary arteries, and the liver, among many other tissues.
“In this work, we presented a comprehensive map of the human epigenome, EpiMap, encompassing approximately 15,000 epigenomic tracks across 833 distinct biological samples that greatly expand the coverage of both embryonic and adult tissues and cells,” the scientists concluded. While they acknowledge some limitations of the collection, they also say the work will open the way to future studies: “…hierarchical and multi-resolution tree-based analyses of gene regulation and GWAS; machine learning-based gene circuitry and combinatorial regulatory motif analyses; more sophisticated network analyses of our tissue–trait, trait–trait and tissue–tissue relationships; and guiding the experimental prioritization, methodological development and validation experiments, which can continue to further our understanding of gene regulation and human disease circuitry.” The team has generated an interactive website to explore their data, at http://compbio.mit.edu/epimap.
Kellis’ lab is now working with diverse collaborators to pursue their leads in specific diseases, guided by these genome-wide predictions. They are profiling heart tissue from patients with coronary artery disease, microglia from Alzheimer’s patients, and muscle, adipose, and blood from obesity patients, which are predicted mediators of these diseases based on the current paper, and his lab’s previous work.
Many other labs are already using the EpiMap data to pursue studies of diverse diseases. “We hope that our predictions will be used broadly in industry and in academia to help elucidate genetic variants and their mechanisms of action, help target therapies to the most promising targets, and help accelerate drug development for many disorders,” Kellis said.