Offering a glimpse of the bright gleam of victory, the 1000 Genomes Project Consortium has announced that it has accomplished its goal, the creation of a comprehensive catalog of human genomic variation. When it first mustered an international team of scientists back in 2008, the project planned to build a reference dataset that would show how rare genomic variants were distributed among populations around the world—or at least within a microcosm of 1000 individuals. Now, seven years later, the project has reconstructed the genomes of 2,504 individuals from 26 populations across Africa, East and South Asia, Europe, and the Americas.
The culmination of the project was described in a pair of papers that appeared September 30 in Nature, along with an editorial that carried a Churchillian title, “Human genomics: The end of the start for population sequencing.” Presumably, the beginning of the end of population sequencing would see researchers and clinicians leveraging genomic variant information to develop improved diagnostics and treatments, in addition to new methods of prevention.
Formerly, genomic variant information was too scant to support such gains. Going forward, it may be considerably easier to use variant information in a wide range of studies of human biology and medicine. This may be what the end of the beginning looks like: the availability of a genomic catalog that can provide the basis for a new understanding of how inherited differences in DNA can contribute to disease risk and drug response.
In one of the Nature articles—“A global reference for human genetic variation”—the 1000 Genomes Project identified “over 88 million variants (84.7 million single nucleotide polymorphisms, 3.6 million short insertions/deletions, and 60,000 structural variants), all phased onto high-quality haplotypes.” This information contributes to a resource encompassing more than 99% of SNP variants with a frequency of more than 1% for a variety of ancestries.
Most of the identified variants are small, affecting only a single base. Nonetheless, small but complex changes are also evident.
“About one-quarter of these variants are common and occur in many or all populations, while about three-quarters occur in only 1 percent of people or are even more rare,” said Lisa Brooks, Ph.D., program director in the NHGRI Genomic Variation Program. “The 1000 Genomes Project data are a resource for any study in which scientists are looking for genomic contributions to disease, including the study of both common and rare variants.”
In the other Nature article—“An integrated map of structural variation in 2,504 human genomes”—differences in the structure of the genome were examined. Nearly 69,000 structural variants were found. These genomic differences, many of which affected genes, include deletions (loss of DNA), insertions (added DNA), and duplications (extra DNA copies). The researchers created a map of eight classes of structural variants that potentially contribute to disease.
“Analysing this set [of eight structural variant classes], we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes,” the article indicated. “We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci.”
This study also uncovered appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events.
“Structural variation is responsible for a large percentage of differences in the DNA among human genomes,” said Jan Korbel, Ph.D., an investigator at the European Molecular Biology Laboratory and senior author of the structural variation article. “No study has ever looked at genomic structural variation with this kind of broad representation of populations around the world.”
One of the more immediate uses of 1000 Genomes Project data is for genome-wide association studies (GWAS), which compare the genomes of people with and without a disease to search for regions of the genome that contain genomic variants associated with that disease. Such studies generally find several genomic regions associated with a disease and many variants in each of those regions. Scientists can now combine GWAS data with the more detailed 1000 Genomes Project data to home in on regions affecting disease more precisely. Instead of sequencing the genomes of all the people in a study, which remains expensive, researchers can use the 1000 Genomes Project data to find most of the variants in those regions that are associated with the disease.
To Gonçalo Abecasis, Ph.D., chair of biostatistics at the University of Michigan in Ann Arbor and co-principal investigator for global reference study, the value of the 1000 Genomes Project extends far beyond the data. Advances in DNA sequencing and bioinformatics were vital to completing the project. For example, the global reference study relied on a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. In the structural variation study, the integrated map of variant classes was constructed using short-read DNA sequencing data, which was statistically phased onto haplotypes blocks.
“We've learned a great deal about how to do genomics on a large scale,” said Dr. Abecasis. “Over the course of the 1000 Genomes Project, we developed new, improved methods for large-scale DNA sequencing, analysis, and interpretation of genomic information, in addition to how to store this much data. We learned how to do quality genomic studies in different contexts and parts of the world.”