Investigators at the Johns Hopkins Kimmel Cancer Center have developed a machine learning strategy that has shown the potential to predict cases of early-stage lung or liver cancers in humans, by detecting repetitive genetic sequences in the genome in cancerous tissue, as well as in cell-free DNA (cfDNA). The team suggests that the new method could provide a noninvasive means of detecting and characterizing cancers, or monitoring response to anticancer therapy.

In laboratory tests, the method, called ARTEMIS (Analysis of RepeaT EleMents in diSease) examined over 1,200 types of repeat elements comprising nearly half of the human genome, and identified that a large number of repeats not previously known to be associated with cancer were altered in tumor formation. The investigators also were able to identify changes in these elements in cfDNA—fragments shed from tumors that are present in the bloodstream—providing a way to detect cancer and determine where in the body it originated.

“When you think about existing cancer genes and the DNA sequences around them, they’re just chock full of these repeats,” said study co lead Victor E. Velculescu, MD, PhD, a professor of oncology and co-director of the Cancer Genetics and Epigenetics Program at the Johns Hopkins Kimmel Cancer Center. “Until ARTEMIS, this dark matter of the genome was essentially ignored, but now we’re seeing that these repeats are not occurring randomly,” Velculescu says. “They end up being clustered around genes that are altered in cancer in a variety of different ways, providing the first glimpse that these sequences may be key to tumor development.”

Velculescu, together with colleagues, including co-lead Akshaya Annapragada, an MD/PhD student at the Johns Hopkins University School of Medicine, and Robert Scharpf, PhD, an associate professor of oncology at Johns Hopkins, reported on the development and testing of ARTEMIS, in a paper in Science Translational Medicine titled “Genome-wide repeat landscapes in cancer and cell-free DNA.” In their report they concluded that their analyses “… reveal widespread changes in repeat landscapes of human cancers and provide an approach for their detection and characterization that could benefit early detection and disease monitoring of patients.”

Repeats of DNA sequences, often referred to as “junk DNA” or “dark matter,” are found throughout the human genome, and are “a hallmark of cancer and other diseases,” the authors wrote. “Genomic repeats comprise more than half the human genome and include a diverse set of elements that vary widely between individuals and exert key influences on genome structure and function.” However, they continued, characterizing these repetitive sequences has been challenging using standard sequencing approaches.

“Because of technical limitations of short-read alignment and a reliance on incomplete genome assemblies, repeats have historically been neglected.” The development of liquid biopsies for the detection and genome-wide characterization of human cancers has allowed scientists to start analyzing repeated sequences in cell-free DNA (cfDNA). Yet, noted, “… no systematic analysis of the compendium of repeat sequences has been performed in tissue or cfDNA of any human cancer, largely due to the inability to identify and quantify repeat sequences in a genome-wide fashion.”

To address these existing challenges, the team developed ARTEMIS, as what they described as an alignment-free, genome-wide approach to analyzing repeat landscapes in short-read sequencing. In a series of laboratory tests, the researchers first examined the distribution of 1.2 billion kmers (short sequences of DNA) defining unique repeats, finding them enriched in genes commonly altered in human cancers.

For example, they reported, of 736 genes known to drive cancers, 487 contained an average fifteen-fold higher than expected number of repeat sequences. These repeat sequences also were significantly increased in genes involved in cell signaling pathways that are commonly dysregulated in cancers. “… these observations of repeat kmer localization suggest that alterations in key genes affecting oncogenic pathways in human cancer may be selected for during tumorigenesis using repeat-related genomic changes,” the team noted.

An overview of the ARTEMIS method, which revealed 1.2 billion unique kmers spanning 1,280 distinct repeat elements in samples from patients with cancer.
An overview of the ARTEMIS method, which revealed 1.2 billion unique kmers spanning 1,280 distinct repeat elements in samples from patients with cancer. [Annapragada et al., Sci. Transl. Med. 16, adj9283 (2024)]
Using next-generation sequencing technology that allows researchers to rapidly examine the sequences of entire genomes, the researchers also looked to see if repeat sequences were directly altered in cancers.

They used ARTEMIS to analyze over 1,200 distinct types of repeat elements in tumor and normal tissues from 525 patients with different cancers participating in the Pan-Cancer Analysis of Whole Genomes (PCAWG). The analysis found a median of 807 altered elements in each tumor. Nearly two-thirds of these elements had not previously been observed as being altered in human cancers. “A median of 807 repeat elements (range, 246 to 1280) hadincreased or decreased kmer counts in tumors compared to their matched normal tissues,” the team reported. “Nearly two-thirds of altered elements (820 of 1280) had not been previously observed as being altered in human cancer.”

Then, they used a machine-learning model to generate an ARTEMIS score for each sample to provide a summary of genome-wide repeat element changes that were predictive of cancer. ARTEMIS scores distinguished the 525 PCAWG participants’ tumors from normal tissues with a high performance—overall area under the curve (AUC) =0.96—across all cancer types analyzed, where 1 is a perfect score. Increased ARTEMIS scores were associated with shorter overall and progression-free survival regardless of tumor type.

“Despite germline variability of repeat elements among different individuals, cross-validated ARTEMIS scores distinguished 525 PCAWG tumors from normal tissue with high performance across all cancer types analyzed, regardless of the race of patients [overall area under the curve (AUC) = 0.96]” they stated. “Given that the ARTEMIS score captures genome-wide changes to repeat landscapes, our observations are consistent with previous analyses indicating that reactivation and increase in repeat elements in cancer genomes may lead to increased immune responses or genomic instability, both mechanisms that could reduce tumor cell fitness and lead to improved patient outcomes.”

The investigators next evaluated ARTEMIS’ potential for noninvasive detection of cancer. They applied the tool to blood samples from 287 individuals with and without lung cancer participating in the Danish Lung Cancer Screening Study (LUCAS). ARTEMIS classified patients with lung cancer with an overall AUC of 0.82. And when used with another method called DELFI (DNA evaluation of fragments for early interception) the combination model classified patients with lung cancer with an AUC of 0.91. DELFI is an assay previously developed by Velculescu, Scharpf, and other members of their group that detects changes in the size and distribution of cfDNA fragments across the genome.

Similar performance was observed in a group of 208 individuals at risk for liver cancer, in which ARTEMIS detected individuals with liver cancer among others with cirrhosis or viral hepatitis, with an AUC of 0.87. When combined with DELFI, the AUC increased to 0.90.

Finally, the team evaluated whether the ARTEMIS blood test could identify where in the body a tumor originated in patients with cancer. When trained with information from the PCAWG participants, the tool could classify the source of tumor tissues with an average 78% accuracy among 12 tumor types.

The investigators then combined ARTEMIS and DELFI to assess blood samples from a group of 226 individuals with breast, ovarian, lung, colorectal, bile duct, gastric or pancreatic tumors. Here, the model correctly classified patients among the different cancer types with an average accuracy of 68%, which improved to 83% when the model was allowed to suggest two possible tumor types instead of a single cancer type… “Despite the small number of samples available for training, we found that ARTEMIS-DELFI correctly categorized detected patients among the different cancer types with an average of 68 or 83% accuracy, for the highest or top two predictions, respectively,” they stated.

“Our study shows that ARTEMIS can reveal genome-wide repeat landscapes that reflect dramatic underlying changes in human cancers,” Annapragada said. “By illuminating the so-called ‘dark genome,’ the work offers unique insights into the cancer genome and provides a proof-of-concept for the utility of genome-wide repeat landscapes as tissue and blood-based biomarkers for cancer detection, characterization and monitoring.”

The authors further wrote, “Repeat landscape analyses for cfDNA-based detection of lung, liver, and other cancers suggest that ARTEMIS alone or in combination with other genome-wide features may provide an avenue for noninvasive detection, monitoring, and tissue of origin determination of cancer… ARTEMIS may improve early-stage diagnosis by identifying genome-wide changes that would perhaps not be evident in other liquid biopsy approaches when tumor features such as mutations or chromosomal arm changes are not detected.”

Next steps, suggests Velculescu, whose competing interests, among those of other authors, are outlined in the paper, will be to evaluate the approach in larger clinical trials. “You can imagine this could be used for early detection for a variety of cancer types, but also could have uses in other applications such as monitoring response to treatment or detecting recurrence,” Velculescu commented. This is a totally new frontier.”

Acknowledging limitations of their study, the authors concluded in their report, “Given the size, diversity, and potential clinical relevance of these regions of the genome, our study offers unique insights into the cancer genome and provides a proof of concept for the utility of genome-wide [sequence] repeat landscapes as tissue and blood-based biomarkers… In addition, the expansion or contraction of repeat elements that can now be comprehensively identified provides a new way to detect and examine mechanisms affecting cancer and other diseases.”

Previous articleUncovering RNA Splicing’s Cell-Specific Impact on Clonal Blood Disorders with GoT-Splice
Next articleAlexandria Founder Talks Takeda, Massachusetts, NYC, and 30th Anniversary