Kevin Mayer Senor Editor Genetic Engineering & Biotechnology News

Pooled, De-Identified, and Richly Annotated Data Resource Promises “Insights through Diversity”

The Allele Frequency Community (AFC), an initiative that is compiling ethnically diverse genomic information, intends to improve the interpretation of gene variants in personalized medicine. The more data the AFC amasses, and the more curatorial activity the group stimulates, the more confidence it builds. Just two days after it was launched, the AFC reported impressive growth.

At its inception, on March 2, the AFC indicated that it had collected 70,100 sets of diverse human exome- and genome-wide variant call datasets spanning over 100 countries of origin. These datasets included 8,000 whole genomes. By March 4, the resource grew to 82,717 sequences—an 18% increase. In addition, whole genomes reached 13,000.

Quantity, the AFC realizes, isn’t everything. When asked about quality, an AFC representative noted that the organization was prepared to apply a variety of quality control measures: simple quality thresholds that could exclude inferior data, that is, data failing to meet community standards; algorithms that could ensure genetic information from the same individual isn’t counted multiple times; and more complex algorithms, for example, that could mitigate potential biases caused by family/cohort studies.

Such measures could help the AFC maintain quality over time, a key challenge for any group that intends to build a crowdsourcing resource. Meeting this challenge is especially important here, since the idea behind the AFC is to distinguish between truly rare genetic variants and genetic variants that merely appear rare, for example, because they happen to be rare in an isolated data collection.

Gene variants are usually considered more or less likely to serve as harbingers of disease depending on their rarity, their allelic frequency. Yet allelic frequency itself can be hard to assess. It can vary not only by ethnicity, but from subpopulation to subpopulation within an ethnic group or by geographic region.

To help medical researchers and diagnosticians account for these potentially confounding factors, the 13 founding members of the AFC agreed to pool their variant call datasets in a secure, anonymized fashion to create what they characterize as the most ethnically diverse, freely accessible, hosted community database of allele frequencies available. Until now, laboratories often collected their own, private allele frequency libraries, but did not have the infrastructure and incentives to integrate their resources into a freely available community asset.

AFC participants include leading life sciences and diagnostic organizations: Columbia University Institute for Genomic Medicine, Emory Genetics Laboratory, Erasmus University Medical Center, Icahn Institute for Genomics and Multiscale Biology at Mount Sinai, The Institute for Systems Biology, Inova Translational Medicine Institute, Laboratory Corporation of America, New York Genome Center, Partners Healthcare, Qiagen, University of British Columbia, and the University of Washington, Weill Cornell Medical Center. More information about AFC participants is available at

One of the participants, Qiagen, is providing secure bioinformatics infrastructure and software for research and clinical laboratories to contribute and gain insights from the AFC. An internal study by Qiagen has already demonstrated the utility of the AFC resource. According to Qiagen, the AFC database helped reduce the average false-positive rate in diagnostic odyssey studies by an average of 43%.

“Community members have already seen disease research cases with up to 90% reduction in false-positive rates using information in the AFC database,” added Douglas E. Bassett, Ph.D., Qiagen’s vice president of translational research and CSO of bioinformatics. “[This] has exciting implications for the future of sequence-based research in the field.”

GEN followed up with Dr. Bassett, who clarified how the AFC fits in with existing variant databases, and how the AFC could demonstrate increasing utility as it grows.

GEN: How might the Allele Frequency Community (AFC) complement the work of large, clinically oriented variant databases such as ClinVar?

Dr. Bassett: The purpose of the Allele Frequency Community is to provide a diverse resource of allele frequency information that enables the community to accurately identify clinically relevant variants in next-generation sequencing (NGS) studies and tests. Over time, availability of the Allele Frequency Community should help us as a community improve the accuracy and consistency of interpretations submitted to resources like ClinVar.

GEN: How does the AFC relate to other broad allele frequency efforts, such as ALFRED and the Allele Frequencies Net Database (AFND)?

Dr. Bassett: These are complementary resources. The focus of the AFC is to ensure streamlined sharing of allele frequency information that empowers interpretation of human NGS data for human health benefit. [Going forward, the AFC could] also provide anonymized, pooled statistics for ethnic subpopulations. For example, a clinical researcher studying a rare disease wants to know if a given variant has been observed at a high frequency in any ethnic subpopulation, regardless of the frequency in the overall population. We have plans to enable this, but only once the dataset for a particular subpopulation reaches a sufficient size to ensure that patient privacy is protected.

GEN: Does the AFC anticipate that it will help laboratories and researchers avoid duplication of effort? Or is it part of the AFC’s value to compile instances of duplication?

Dr. Bassett: Beyond duplication of effort, the Allele Frequency Community really helps us avoid duplication of error—reducing the false-positive rate and putting the “precision” in precision medicine. What I mean by that is, every time we see a patient in a lab from an ethnic group that is under represented in the public sequence databases, there is increased risk of calling a variant potentially disease-causing that in fact is a common polymorphism in that patient’s ancestry.

In an ideal world, every lab would have a database at hand that provides allele frequency information representing thousands of individuals from every ethnic subpopulation on the planet. Since that dataset does not yet exist, and labs are not funded to create it for themselves, labs are forced to do the best they can, using resources that are publicly available and drawing upon their case history.

[To date, a lab that] has dealt with a large number of patients from a particular ethnic subpopulation has had a much easier time differentiating between a potential-disease causing variant and a more common polymorphism within that population. Now, with the AFC, we can reduce the false-positive rates across ancestries, for the benefit of patients.

GEN: Could we clarify what the AFC means by “open” and “freely accessible”?

Dr. Bassett: The AFC is free to join, and the content is freely accessible and hosted within Qiagen’s Ingenuity Variant Analysis. There are no minimum sample requirements to join. The interesting twist is, the founders wanted to set up the AFC such that it would actually grow in benefit as it is used.

There was a desire here to create a real incentive for the community to share, to benefit the future of precision medicine. So, if a lab wants to annotate a given sample with statistics from the AFC database, it pays nothing, but it agrees that that sample can then be used in a de-identified way to compute anonymized, pooled statistics, pooled with all the rest of the samples from the community. So the database grows a bit more useful to the community each time it is used.

Researchers can download the statistics from the AFC database and use them in any tool they’d like, but they lose the benefit of updates and annotations that take place after their download.