January 15, 2007 (Vol. 27, No. 2)
Achieving Sufficient Power to Detect Disease Genes with the Quebec Founder Population
Genome-wide association studies (GWAS) are a powerful method for identifying disease susceptibility genes for common diseases, offering the promise of novel targets for therapeutic intervention that act on the root cause of disease. GWAS involve scanning thousands of samples, either as case-control cohorts or in family trios, utilizing hundreds of thousands of SNP markers located throughout the human genome. Algorithms are applied that compare the frequencies of either single SNP alleles, genotypes, or multimarker haplotypes between disease and control cohorts.
This analysis identifies regions (loci) with statistically significant differences in allele or genotype frequencies between cases and controls, pointing to their role in disease. A schematic of the process is shown in Figure 1.
Novel Susceptibility Genes
GWAS have several advantages over alternative disease gene discovery methods. In contrast to candidate gene studies, which select genes for study based on known or suspected disease mechanisms, GWAS permit a comprehensive scan of the genome in an unbiased fashion and thus have the potential to identify totally novel susceptibility factors.
In comparison to family linkage-based approaches, association studies have two key advantages. First, they are able to capitalize on all meiotic recombination events in a population, rather than only those in the families studied. Because of this, association signals are localized to small regions of the chromosome containing only a single to a few genes, enabling rapid detection of the actual disease susceptibility gene. Second, GWAS allow the identification of disease genes with only modest increases in risk, a severe limitation in linkage studies and the very type of genes one expects for common disorders.
Due to these advantages, GWAS can identify multiple interacting disease genes and their respective pathways, providing a comprehensive understanding of the etiology of disease.
The power to detect association between genetic variation and disease is a function of several factors, including the frequency of the risk allele or genotype, the relative risk conferred by the disease-associated allele or genotype, the correlation between the genotyped marker and the risk allele, sample size, disease prevalence, and genetic heterogeneity of the sample population. While the first three factors are unknown prior to specific GWAS, their impact can be influenced by the study design.
Key Success Factors
While a powerful approach, GWAS are not without challenges. Critical to success is the development of robust study designs to ensure high power to detect genes of modest risk while minimizing the potential of false association signals due to testing large numbers of markers. Key components include sufficient sample sizes, rigorous phenotypes, comprehensive maps, accurate high-throughput genotyping technologies, sophisticated IT infrastructure, rapid algorithms for data analysis, and rigorous assessment of genome-wide signatures.
Population Resources: Critical to success is the collection of sufficient numbers of rigorously phenotyped cases and matched control groups or family trios to have sufficient power to detect disease genes conferring modest risk. Power studies have shown that atleast 2,000 to 5,000 samples for both cases and controls groups are required when using general populations.
This large number of samples makes the collection of rigorously consistent clinical phenotypes across all cases quite challenging. In addition, matching of cases and controls with respect to geographic origin and ethnicity is critical for minimizing false positive signals due to population substructure.
SNP Maps and Genotyping: A second key success factor is having a comprehensive map of hundreds of thousands of carefully selected SNPs. Currently there are several groups offering SNP arrays for genotyping, with Affymetrix (www.affymetrix.comIllumina(www.illumina.com) both providing products containing more than 500,000 SNPs. Achieving high call rates and genotyping accuracy are also critically important, because small decreases in accuracy or increases in missing data can result in relatively large decreases in the power to detect disease genes.
IT and Analytic Tools: Genotyping instruments now have sufficient capacity to enable genotyping of thousands of subjects in only a few weeks. A study of 1,000 cases and 1,000 control subjects using a 550,000 SNP array produces over 1 billion genotypes. To properly store, manage, and process the enormous data sets arising from GWAS, a highly sophisticated IT infrastructure is needed, including computing clusters with sufficient CPUs and automated, robust pipelines for rapid data analysis.
Given this wealth of genotypic data, the availability of efficient analytical tools for performing association analyses is critical to the successful identification of disease-associated signals. Primary genome-wide analyses include a comparison of allele and genotype frequencies between case and control cohorts or for child-affected trios, a comparison of the frequencies of transmitted (case) and nontransmitted (control) alleles. An alternative test of association when using child-affected trios is the transmission disequilibrium test for the overtransmission of alleles to affected offspring.
Since these analyses require considerable computing power to handle terabytes of data, genome-wide analyses are often limited to single SNPs with haplotype analyses performed once candidate regions are identified.
GWAS in the Quebec Founder Population
Genizon BioSciences (www.genizon.com) faces the challenges of GWAS in unique ways. All samples collected come from the French Canadian population of Quebec, known as the Quebec Founder Population (QFP), which is genetically less heterogeneous than general populations. Thus, fewer resources are required for well-powered GWAS. In Genizon’s experience, when using the QFP, only 500 trios or 750 case-controls provide good power for less complex diseases, such as Crohn’s disease (Figure 2) and psoriasis. On the other hand, 1,000–1,500 cases and controls are required for more complex diseases, like type II diabetes and schizophrenia. For general populations, it is usually believed that at least 2,000–5,000 cases and controls may be required.
Use of genetic data from the QFP has also enabled the development of a comprehensive genome-wide SNP map, representative of linkage disequilibrium in this population. Genizon has customized an Illumina HumanHap300 BeadChip (317,000 SNPs) by adding 57,000 SNPs distributed according to the specific genetic profile of this population. Using the Illumina BeadStation genotyping platform, the company routinely achieves call rates and accuracy rates that exceed 99.9%.
To enhance association signals and increase the distance at which they can be detected, Genizon has developed and implemented efficient multimarker haplotype analyses that permit their implementation at the genome-scan stage. Haplotype patterns are estimated using a modified EM algorithm with high accuracy using family trio or case-control data. A high-speed algorithm LDSTATS is then used to rapidly compare the frequency of single markers and haplotypes between patients and controls across the entire genome. The use of haplotype analyses can result in more significant association signals and was found to be critical to the discovery of many of the signals identified in Genizon’s studies.
Of primary importance, due to the large number of markers tested, Genizon assesses the genome-wide significance of the study by permutation analyses. In addition, methods are used to identify gene-gene interactions, including conditional analyses and a multifactorial approach, Random Forests. Subphenotypes and gender-specific analyses are routinely used to provide more homogeneous subsets of patients thereby increasing effect size and statistical significance. These additional analyses have resulted in multiple additional discoveries of important disease-susceptibility genes.
In Genizon’s experience, the use of the QFP, derived from a small group of founders 10–20 generations ago that has expanded in relative isolation to a large population today, coupled with advanced genetic analysis algorithms has had a dramatic impact on the power to discover disease genes.
To date, Genizon has successfully completed eight GWAS, including Crohn’s disease, psoriasis, asthma, endometriosis, ADHD, schizophrenia, longevity, and male pattern baldness. These studies have resulted in the identification of disease-susceptibility genes, their associated pathways, and potential new therapeutic targets.