Next-generation sequencing technologies are also being applied to study evolutionary changes. “There has never been a better time to analyze molecular variation data from natural populations,” notes Paul Marjoram, Ph.D., assistant professor, preventative medicine, Keck School of Medicine, University of Southern California (Los Angeles).
“We are examining mutation and recombination rates between individuals. It is a rite of passage for a computational biologist to develop methods to determine the number of mutations in a data set and to then calculate the rate at which mutations happen. The ultimate goal is to design association studies.”
“Genome-wide association studies interrogate the genome in a set of individuals and look for polymorphisms that differentiate two populations—for example, those that have a disease and those that don’t, explains Dr. Marjoram. “The problem is that often you can only derive partial information. There are holes and gaps in the coverage of the genome. These gaps can be filled by inferring sequences and imputing the missing data. This can be made easier by referring to an external library of data for related individuals in which you already know what falls in the missing regions.
“The key question to ask when starting association studies is ‘how big a sample do I need?’ So, the first step is to do a power calculation. There are a number of ways to do this, but the bottom line is that you have to divide the coverage across samples. We have found that it is better to divide coverage equally across individuals. But, even given that knowledge, you still need to decide whether to use, for example, 100 individuals and 20-fold coverage, or 500 individuals with fourfold coverage.”
Ultimately, it is a like the race of the tortoise and the hare. The hare, in this case, will use inexact methods to more quickly produce a best guess, while the tortoise will perform slow, steady, and difficult annotation of genomic sequences. “Both approaches will produce results in their own time, but faster and more useful methods are available right now by using simplified models or summaries of the data,” concludes Dr. Marjoram.