Investigators say results of five-year project will aid in the search for disease-related genetic variants.
A comparison of the genome sequences of 29 placental mammals has found that some 4.2% of the human genome contains evolutionarily constrained elements. An international team led by researchers at the Broad Institute, the Genome Institute at Washington University, Baylor College of Medicine Human Genome Sequencing Center, and the Uppsala University in Sweden, spent five years sequencing and comparing the genomes from disparate mammalian species.
Their results, published in Nature, include the identification of nearly 4,000 previously undetected exons, 220 candidate RNA structural families, 2.7 million predicted targets of transcription factors, and close to a million elements that overlap potential promoter, enhancer, and insulator regions. The Broad Institute’s Eric S. Lander, Ph.D., Manolis Kellis, Ph.D., and Kerstin Lindblad-Toh, Ph.D., and colleagues report their findings in a paper titled “A high-resolution map of human evolutionary constraint using 29 mammals.”
Studies to date indicate that just 1.5% of the human genome carries protein-coding sequence, and estimates suggest that another 5% of genome content is probably functional, and of this about 3.5% consists of noncoding elements with probably regulatory roles, the authors note. Although hints at which regions are most important can be gleaned by comparing the genomes of different species to identify evolutionarily constrained sequences, previous comparative mammalian studies have focused on just the top 5% of sequences that have been evolutionarily constrained between species, but this covers less than 0.2% of the genome.
In 2005, the Sanger Institute-led team began its work to sequence a large collection of mammalian genomes and identify and characterize functional elements on the basis of their evolutionary signatures. The results have been assembled from the systematic characterization of constrained sequences in the 29 placental mammal genomes that were sequenced and compared.
The authors claim their analyses suggest that at 12 base-pair resolution, some 3.6 million elements spanning 4.2% of the genome are constrained between species, and the mean element size of 36 bp is actually much shorter than the 123 bp mean element size detected in a prior human–mouse–rat–dog (HMRD) comparison.
The analyses indicated that about 1.3% more of the human genome than previously thought comprises constrained elements, and 22% of the nucleotides contained within them lie in newly identified elements and are enriched in noncoding regions. Interestingly, the results also indicated that the more constrained the element, the less likely it is to harbor SNPs. “Not only are constrained regions less likely to exhibit polymorphism in humans,” the team notes, “but when such polymorphisms are observed, the derived alleles in humans tend to match the alleles present in nonhuman mammals, indicating a preference for the same alleles across both mammalian and human evolution.”
Initial analyses of the 3.6 million evolutionarily constrained elements also indicated that about 30% were associated with protein-coding transcripts, but the majority were associated with noncoding intronic (29.7%) and intergenic (38.6%) regions.
This finding led the researchers to study the overlap of these elements with evolutionary signatures characteristic of specific types of features, and public large-scale experimental data. This led to the detection of 3,788 new exon candidates, 54% of which reside outside transcripts of protein-coding genes, 19% within introns, and 13% in the UTRs of known coding genes.
Thirty-one percent of intronic and 13% of intergenic predictions appear to extend known transcripts, and of these, 5% and 11%, respectively, reside in new transcript models, the team continues. Interestingly, the analysis identified detected coding regions with a very low synonymous substitution rate, which indicates additional sequence constraints beyond the amino acid level, they note. “We found >10,000 such synonymous constraint elements (SCEs) in more than one-quarter of all human genes. Initial analysis indicates potential roles in splicing regulation (34% span an exon–exon junction), A-to-I editing, microRNA (miRNA) targeting, and developmental regulation.”
Using evolutionary signatures characteristic of conserved RNA secondary structures, the team in addition identified 37,381 candidate structural elements, including 1,192 novel families of structural RNAs. “Noteworthy examples include: a glycyl-tRNA family, including a new member in POP1, involved in tRNA maturation and probably involved in feedback regulation of POP1; three intronic families of long hairpins in ion-channel genes known to undergo A-to-I RNA editing and possibly involved in regulation of the editing event; an additional member of a family of 59 UTR hairpins overlapping the start codon of collagen genes and potential new miRNA genes that extend existing families.”
They similarly found patterns of conservation within core promoters, which were subdivided into categories: those with uniformly “high” constraint (7,635 genes, 13,996 transcripts); uniformly “low” constraint (2,879 genes, 4,135 transcripts); and “intermittent” constraint, consisting of alternating peaks and troughs of conservation (14,271 genes and 29,814 transcripts). The high constraint promoters were more commonly associated with development, while those categorized as intermittent constraint promoters were largely associated with genes involved in basic cellular functions, and the low-constraint sequences were enriched in genes associated with immunity, reproduction, and perception.
Turning to regulatory motifs, the authors found that the comparison of 29 mammalian genomes significantly improved the ability to detect individual motif instances, thus making it possible to predict specific target sites for 688 regulatory motifs corresponding to 345 transcription factors.
“With just a few species, we didn’t have the power to pinpoint individual regions of regulatory control,” notes professor Kellis, who is associate professor of computer science at MIT. “This new map reveals almost 3 million previously undetectable elements in noncoding regions that have been carefully preserved across all mammals, and whose disruptions appear to be associated with human disease.”
Overall, about 30% of constrained elements overlap were associated with protein-coding transcripts, some 27% overlapped specific enriched chromatin states, roughly 1.5% novel RNA structures, and about 3% conserved regulatory motif instances. “Together, about 60% of constrained elements overlap one of these features, with enrichments ranging from 1.75-fold for chromatin states (compared to unannotated regions) up to 17-fold for protein-coding exons (compared to the whole genome),” the authors state.
They maintain that the identification of constrained elements will be particularly useful in the search for disease-associated genetic variation, and complement experimental studies that require prior knowledge of biochemical activity. “Most of the genetic variants associated with common diseases occur in non-protein-coding regions of the genome,” comments professor Lindblad-Toh, scientific director of vertebrate genome biology at the Broad Institute and professor of comparative genomics at Uppsala University in Sweden. “In these regions it is often difficult to find the causal mutation. This catalog will make it easier to decipher the function of disease-related variation in the human genome.”