October 15, 2017 (Vol. 37, No. 18)

Better Methods to Analyze and Organize DNA Fragments Are Emerging

The evolution of the MUC7 gene, which encodes a saliva protein, was studied by scientists at the University at Buffalo. To capture the gene’s proline-, threonine-, and serine-rich tandem repeat copy number variations (CNVs), the scientists analyzed more than 2,500 genomes. This extensive work revealed that MUC7 subexonic CNVs fall into eight divergent haplotype clusters. It also showed that the version of the gene found in sub-Saharan African populations was very different from the versions found in other populations. This finding could be due to archaic introgression, the introduction of genetic material from an unknown human relative. [Bob Wilder/University at Buffalo]

An overarching challenge has been to accurately identify and describe CNVs, particularly because many of them occur in repetitive genomic regions, which are relatively resilient to sequencing with the short reads that are routinely used. Moreover, no single sequencing approach can accurately identify all CNV types. 

Developing Better Methods

 “We tried to develop more powerful and less expensive ways of identifying copy number variants,” says Michael H. Wigler, Ph.D., professor at Cold Spring Harbor Laboratory. Investigators in Dr. Wigler’s lab recently developed SMASH (short multiply aggregated sequence homologies), a new protocol to measure CNV. In this approach, genomic DNA is randomly sheared into fragments with a mean length of about 40 base-pairs that are subsequently joined together into chimeric stretches to build next-generation sequencing libraries. Compared to whole-genome sequencing, SMASH generates multiple independent mappings in each sequence, increasing the information density per read.

“SMASH was our best offering on how to reduce the cost of large-scale copy number measurement by sequence,” says Dr. Wigler. Another advantage of SMASH is that the mechanical and/or enzymatic shearing process eliminate the GC bias. “Sequencing platforms may change and may get cheaper, but this will always be the cheapest way to do it, and that is because we break the DNA up into minimum size pieces,” says Dr. Wigler.

Another effort in Dr. Wigler’s lab led to the development of single nucleus sequencing (SNS), a method that allows the characterization of the copy number profile of a single cell. In a proof-of-principle analysis that used breast cancer cells, Dr. Wigler and colleagues showed that SNS can identify clonal expansions during tumor growth and also compare the copy number profile of metastatic cells with those of the primary tumor. Building on this work, Dr. Wigler and colleagues recently proposed an approach for the early detection of cancer using a blood test. This approach proposes that screening for cancer cells in the blood, to detect disease before clinical onset, should be based on a universal signal that emerges from the genomic DNA of cancer cells based on their shared CNV profile, which is acquired during the clonal expansion of the malignancy. After enrichment for atypical rare cells from the blood, the cells are separated based on surface proteins, and their copy number is profiled using a computational approach. Analysis of the individual cells can inform investigators about the tissue of origin and help guide diagnosis and treatment. The alternative approach, currently, is to profile sequence variants using cell-free DNA in the blood.

“Our method is somewhat orthogonal to that, and we don’t know which one will work better yet,” says Dr. Wigler. One of the advantages of examining individual circulating tumor cells is what Dr. Wigler calls “phase.” “When we see a lot of mutations in single-cell analysis, we know that they are all on the same cell, and that provides certainty that the cell is malignant,” notes Dr. Wigler. Mutations that are seen on cell-free DNA isolated from the blood might originate from the same or from different normal cells that have undergone somatic mutations. “There are reasons why one might want to do both methods,” says Dr. Wigler. This method boasts accurate risk assessment for early cancer detection and to examine response to therapy. “I think in the future this will take survival beyond the metastatic stage, and doing this within the next few years is very plausible,” says Dr. Wigler.

In-Depth Analysis of CNVs

 “For the first time, we are getting near-complete genomes that are phased with respect to structural variations,” says Charles Lee, Ph.D., professor and director of The Jackson Laboratory for Genomic Medicine and president of the Human Genome Organization. A recent effort by the Human Genome Structural Variation Consortium, which Dr. Lee cochairs, has focused on placing structural variants on the distinct haplotypes, which is required for their proper genotyping. Structural variants often mutate faster than single nucleotide polymorphisms. “Which makes structural variants difficult to impute, since linkage disequilibrium between them and other structural variants and single nucleotide variants break down faster,” says Dr. Lee. As a result, many CNVs have to be identified directly and then be placed on the proper haplotypes.

“When the 1000 Genomes Project first came out, we were picking up only deletions, which was the state of the art, but now we are able to also see duplications, mobile element insertions, and even segments of mitochondrial DNA that became inserted into the chromosomal DNA,” says Dr. Lee. In a study that examined over 2,500 human genomes (across 26 human populations) from the 1000 Genomes Project, Dr. Lee and colleagues used short-read DNA sequencing fragments—approximately 125 base-pairs long—and cataloged over 68,000 structural variants falling into eight classes.

“Over the past two years, we have been working on a new project, in which we looked at a smaller number of individuals but in a much more comprehensive way with respect to structural variation, using orthogonal genomic technologies, including long-read technologies. We are finding so much more structural variation,” adds Dr. Lee. Many structural variants occur in parts of the genome that have tandem repeats surrounding them, and this partly explains the difficulties in annotating them.

“Short reads have significant limitations in terms of what they are able to find,” says Dr. Lee. Long-read technologies, as well as technologies such as strand-seq and optical mapping, provide information on a larger scale and are ideally positioned to identify and characterize structural variants in chromosomal regions that were much more challenging to identify and analyze in the past. A better insight into the structural complexity of the CNVs may shed light on their functional relevance. “Once our results are released, people will be able to use the information to go into their gene or region of interest and delve into the functional aspects in much more detail than we will be doing with the scope of the current study,” says Dr. Lee.

Paralleling the growing interest in understanding chromatin structure-function relationship, a topic of interest in recent years has been to interrogate the way in which structural variants can change chromatin structure locally. “We are just starting to scrape the surface of how structural variation changes may cause a local change in chromatin structure and how that alters the expression of a gene nearby and leads to a phenotype, and a lot more work is needed to
look at this,” says Dr. Lee.

Evolution and Links to the Microbiome

 “We have always been proponents of the involvement of copy number variants in shaping human phenotype, disease, and evolution, but that is a very hard thing to test,” says Omer Gokcumen, Ph.D., assistant professor of biological sciences at State University of New York at Buffalo. Most inter-individual genomic variation, when the number of base pairs is considered, is generated by CNVs. “The problem is that the mechanisms that create these copy number variants as compared to single nucleotide variants are different,” says Dr. Gokcumen. CNV may be created through repair errors from nonhomologous recombination, other types of recombination errors, or gene conversions. “This leads to differences in where they occur in the genome and, also, to differences in their size, and generates many subcategories of copy number variants, with potentially different implications in disease and type of disease,” says Dr. Gokcumen.

Salivary MUC7, an abundant human protein encoded by the mucin-7 gene, is a good example of such variation in the types of functional CNVs and a primary focus in Dr. Gokcumen’s lab. The mucin-7 gene carries 69-base-pair subexonic repeats that occur in various copy numbers among individuals. These repeats, which affect protein size, encode for densely O-glycosylated proline-, threonine-, and serine-rich (PTS) tandem repeat domains. The protease-resistant glycan regions provide binding sites for microbes.

“The 1000 Genomes Project, which is the ultimate and most accurate database that we have right now, is completely missing the copy number variation in this gene there because it is so hard to discover them with shorts reads,” says Dr. Gokcumen. To capture the PTS-repeat CNV in mucin-7, Dr. Gokcumen and colleagues genotyped 251 random samples from the 1000 Genomes Project and revealed that CNV in this gene occurred independently in at least two branches of the human phylogenetic tree. Subsequently, an analysis of more than 2,500 genomes revealed that MUC7 subexonic CNVs fell into eight divergent haplotype clusters. Two haplogroups had the 5 PTS-repeat allele and the others harbored the 6 PTS-repeat allele. The gene in a group of genomes from sub-Saharan Africa was very different from the versions that were found in other humans.

Three independent studies have previously associated CNV in PTS repeats of MUC7 with asthma susceptibility.1–3 However, one of the shortcomings with the studies was that it was determined that individual single nucleotide polymorphisms do not accurately predict the copy number state, making genome-wide association studies unable to accurately predict repeat numbers. To address this challenge, Dr. Gokcumen and colleagues identified several single nucleotide polymorphisms that tag independently evolved PTS-repeat alleles and found combinations that predicted the copy number status. This analysis revealed that the number of subexonic repeats in mucin-7 was not associated with asthma, but discovered a correlation between haplotypic variation in this gene and oral microbiome composition.

“Where we are heading is not necessarily individual copy number variants any longer, but more a haplotype-level understanding of genetic contribution,” says Dr. Gokcumen. Individual variants may change the function of multiple genes in a manner that is shaped by other variants somewhere else in the genome. “On top of that and even further up, we have a large contribution from the transcriptome and microbiome, and now the question becomes how a copy number variant and the associated single nucleotide polymorphisms are behaving given a particular diet, under a particular microbiome composition, or under a specific immune pathogenic pressure, rather than just making a simple association,” says Dr. Gokcumen.

CNVs in Genitourinary Malignancies

“We looked at several genitourinary malignancies in The Cancer Genome Atlas,” says Mark W. Ball, M.D., clinical fellow in urologic oncology at the National Cancer Institute. In a recent analysis of datasets for several genitourinary malignancies from The Cancer Genome Atlas, conducted at the Johns Hopkins University School of Medicine, Dr. Ball and colleagues examined the correlation between somatic mutations and CNVs with pathological features and survival outcomes. “We did not find this for every malignancy, but the overall trend was that a higher rate of copy number variation was associated with advanced pathological features and worse survival outcomes,” says Dr. Ball.

One of the limitations in understanding the involvement of CNVs in genitourinary malignancies is the scarcity of data. “Previously, we did not really have a lot of data, and currently The Cancer Genome Atlas is all that we have,” says Dr. Ball. Even then, much less data are available as compared to other malignancies, such as colorectal cancers, for which copy number changes have been studied to a much greater extent. “There is a push and an enthusiasm on multiple fronts towards an era of personalized medicine, and if we perform more sequencing and copy number variation assays we will gain a lot of understanding of how these endpoints, which are currently mostly theoretical, translate into picking a treatment strategy for a patient,” says Dr. Ball.

Cytochromes and Medication Metabolism

 “We have been using the Pacific Biosciences long-read third-generation sequencing platform to interrogate genomic regions that would typically not be accessible using short-read technologies,” says Stuart Scott, Ph.D., associate professor of genetics and genomic sciences at the Icahn School of Medicine at Mount Sinai. A key focus in Dr. Scott’s lab is CYP2D6, a highly polymorphic gene that encodes a cytochrome P450 involved in the metabolism of approximately a quarter of all common medications. Predicting CYP2D6 metabolizer status requires knowledge of both the copy number and the sequence of each allele, because duplication of a functional copy will lead to an increased amount of an active enzyme, but duplication of a non-functional allele will still not be functional.

Using short-read sequencing to interrogate the CYP2D6 gene is challenging due to the high sequence homology between CYP2D6 and its neighboring pseudogenes. Consequently, Dr. Scott and colleagues developed a single-molecule real-time (SMRT) sequencing assay, which allowed for full gene characterization and the discovery of duplicated and novel alleles. “For samples that had more than two copies, we would also amplify and barcode the duplicated region and sequence it,” says Dr. Scott. The size of the gene, approximately 5 kb, allowed sequencing of the entire gene in one long multiplexed read. “On the analytical side, which is the genetic testing side, we need to be able to generate high-resolution sequencing data for these difficult-to-interrogate regions and genes, and have high confidence in the data that we derive from sequencing,” says Dr. Scott. On the clinical utility side, an important consideration is to identify whether any therapeutic recommendations or changes can be made for an individual, based on the results. “It is critical to determine whether these changes would lead to any therapeutic benefits, and that is currently very highly debated,” says Dr. Scott.

1. H.J. Kirkbride et al., “Genetic Polymorphism of MUC7: Allele Frequencies and Association with Asthma,” Eur. J. Hum. Genet. 9(5), 347–354 (May 2001).
2. A.M. Watson et al., “MUC7 Polymorphisms are Associated with a Decreased Risk of Being Diagnosed with Asthma in an African-American Population,” J. In-vest. Med. 57(8), 882–886 (2009), doi:10.231/JIM.0b013e3181c0466d.
3. K. Rousseau et al., “MUC7 Haplotype Analysis: Results from a Longitudinal Birth Cohort Support Protective Effect of the MUC7*5 Allele on Respiratory Func-tion,” Ann. Hum. Genet. 70(Pt 4), 417–427 (July 2006).

Previous articlePhiladelphia Project Gains STEAM
Next articleBiotech VC Eyes Record Year after Strong Q3