Sponsored content brought to you by

 

illumina logo

As sequencing costs decrease, the volume of whole genome sequencing (WGS) and whole exome sequencing (WES) continues to rise. Sequencing is just the first step. To provide the best results requires analyzing sequencing data with accelerated compute, data science and AI to read and understand the genome, from base calls to variant interpretation. The challenge is substantial.

Human genomes are complex. The current understanding according to the National Human Genome Research Institute is that compared to a reference human genome, on average, an individual’s ~3B-nucleotide genome sequence will have ~4M SNVs, ~600K insertion/deletion variants, and ~25K structural variants that involve greater than 20M nucleotides.1 As of now, the clinical impact of most of these variants is unknown. Can genomic AI help us to identify the handful of clinically significant genetic variants from this vast ocean of data?

Genomic AI

AI methods excel when large amounts of structured data can be paired with validated outcomes for training. Recent population-level sequencing efforts, as well as validation data sets like NIST Genome in a Bottle, have spurred a new category of AI—Genomic AI. Genomic AI has the potential to dramatically reduce the time it takes to analyze, decipher, and interpret sequencing data, but only if the data is carefully assembled across the width of the challenge from alignment to interpretation.

DNA sequencing has substantial promise to guide healthcare and treatment if the needed tools become more accurate, easier to use, and cost effective. Illumina believes that genomic AI is an emerging tool complementary to traditional analysis methods and known biology, that can further accuracy advancements, providing a fully-featured genome including annotation and interpretation. To achieve this the company is using its access to large data and world-class AI talent to integrate genomic AI into Illumina’s software products.

Three examples will be used to illustrate the utility of this advanced technology–variant calling, annotation and prioritization, and interpretation.

Improving Variant Calling Accuracy using AI

The upstream DRAGEN™ secondary analysis pipeline improves variant calling accuracy over a larger portion of the human genome, while ensuring that these improvements are generalizable to a wide and diverse population of samples. Hardware-accelerated DRAGEN analysis won the 2020 Precision FDA germline accuracy competition in the Difficult-to-Map regions and All-Benchmark-Regions categories.2

Building on that success, Illumina added powerful and efficient machine learning (ML) algorithms that drive significant performance improvements.

“DRAGEN-ML integrates closely with the existing Bayesian Variant Calling pipeline, driving germline accuracy to new heights and addressing challenges in the most difficult genomic regions. Sophisticated and efficient machine learning enables improvement in sensitivity and genotyping accuracy, recovering low-confidence false negative calls and filtering over 50% of false positive calls. Access to deep internal data and numerous collaborations have allowed us to model how Illumina sequencing reads map to a genomic reference,” says Rami Mehio, Head of Software and Informatics, Illumina. “Machine learning has been critical to how our engineers and their algorithms continually improve mapping sensitivity in DRAGEN.”

The latest DRAGEN release, DRAGEN v4.2 with enhanced machine learning, trained on a vast amount of data, detects variants with an analytical accuracy of 99.84%, reducing both false positive and false negative rates.* This extends Illumina’s lead in providing the most accurate secondary analysis in all benchmark regions compared to other solutions using PrecisionFDA v2 Truth Challenge3 benchmark data.

Delivering a comprehensive platform for genomic analysis, the team continues to invest more in machine learning algorithms for use in RNA analysis, somatic pipelines, methylation analysis and large variant calling for release in future versions of the DRAGEN platform.

Predicting Variant Pathogenicity

Out of the tens of millions of protein-coding variants in the human genome, only 0.1% are presently annotated in clinical variant databases, while the vast majority remain variants of unknown significance (VUS).

To address this challenge, Illumina scientists have developed PrimateAI-3D, a three-dimensional convolutional neural network for variant effect prediction, trained using primate variants and 3-D protein structure. PrimateAI-3D leverages the premise that common variants from non-human primates are unlikely to cause human disease, and has been validated to identify disease-causing variants with superior accuracy across six clinical benchmarks based on real-world patient cohorts.

Published in Science, the PrimateAI-3D project helped drive a massive international collaborative effort to sequence 809 individuals from 233 primate species and create a catalog of common missense variants. Importantly, the species selected for sequencing represent close to half of Earth’s 521 extant primate species and cover all major primate families.4 These WGS data were used to train PrimateAI-3D with millions of primate variants.

In a related Science publication, PrimateAI-3D was used to estimate the pathogenicity of rare coding variants in over 450K UK Biobank individuals in order to improve rare-variant association tests and genetic risk prediction for common diseases and complex traits. Stratification of the missense variants using PrimateAI-3D enabled discovery of 73% more significant gene-phenotype associations in rare variant burden tests, outperforming other existing variant interpretation algorithms.5

PrimateAI-3D also enables rare-variant polygenic risk scores (PRS), which are substantially more portable to different cohorts and ancestry groups not used during model training.5 This outcome is extremely relevant as existing PRS algorithms most often train on data from individuals of European descent, which lacks generalization to individuals of other populations.

The PrimateAI-3D deep learning scores and the primate population variant database, which enables classification of 4.3M missense variants as likely benign, are publicly available to the genomics community for research use, in addition to being made available through Illumina software products.

Complementary to PrimateAI-3D’s role for protein-coding variants, Illumina scientists earlier released SpliceAI, a deep learning model for identifying pathogenic variants in the non-coding genome. Currently, clinical exome sequencing for rare disease patients is only able to detect a pathogenic variant in around one third of cases by examining the 1% of the genome that is protein-coding. Improving identification of

disease-causing variants in the non-coding genome extends clinical sequencing

beyond the exome to the whole genome, marking an important step towards helping patients and their families.6

Accelerating Variant Interpretation with AI

Explainable AI (XAI), created by and integrated in Emedgene™ tertiary analysis software, prioritizes variants that are most likely to solve a case.

Emedgene’s XAI allows users to fully map the logic and comprehend the results by its artificial intelligence genomics algorithms, while keeping the geneticist in full control. By definition, XAI must be accurate, secure, transparent, and efficient.

Emedgene, for hereditary disease data interpretation applications and assays—spanning genomes, exomes, targeted panels, and virtual panels, leverages its XAI and full suite of automation capabilities for users to streamline and minimize touchpoints across their end-to-end germline analysis workflows. This variant interpretation research platform for rare-genetic, hereditary cancer and other genetic diseases, and large-scale screening projects, significantly reduces time per case.

The use of genomic XAI in Emedgene mimics the work performed by a scientist and provides a full causal explanation of the most relevant variants with accompanying linked and curated evidence. Significant time savings of 50-75% are achieved per case. “Emedgene’s Explainable AI (XAI) simplifies the highly complex task of variant prioritization, allowing us to handle more tests every day,” relates Ray Louie, PhD, Associate Director, Greenwood Genetic Center.

In addition, a study performed by Baylor Genetics showed that in a 180-sample cohort Emedgene accurately pinpointed the manually reported variants as candidates to resolve the case. The reported variants were ranked in the top 10 candidate variants in 98.4% of trio cases, in 93.0% of single proband cases, and 96.7% of all cases. Reduction of the accuracy of the model in some cases was due to incomplete variant calling or incomplete phenotypic description.7 The study clearly demonstrated that Emedgene can assist genetic laboratories in prioritizing candidate variants effectively, thereby helping to streamline lab operations.

Decades of internal development and multiple population level collaborations provide Illumina access to massive amounts of data to train new genomic AI algorithms. The data, in combination with Illumina’s world-class products and talent, can help speed genomic AI on its path towards providing a better genome.

 

References

  1. National Human Genome Research Institute, www.genome.gov/about-genomics/educational-resources/fact-sheets/human-genomic-variation
  2. www.illumina.com/science/genomics-research/articles/dragen-wins-precisionfda-challenge-accuracy-gains.html
  3. Catreux S, et. al. DRAGEN sets new standard for data accuracy in PrecisionFDA benchmark data. Optimizing variant calling performance with Illumina machine learning and DRAGEN graph. Published January 12, 2022.
  4. Gao H, et al. The landscape of tolerated genetic variation in humans and primates. Science 2 Jun 2023 Vol 380, Issue 6648 doi: 10.1126/science.abn8197
  5. Fiziev PP, et al. Rare penetrant mutations confer severe risk of common diseases. Science. 2023 Jun 2;380(6648):eabo1131. doi: 10.1126/science.abo1131
  6. Jaganathan K, et al. Predicting splicing from primary sequence with deep learning. Cell 2019 Jan 24;176(3):535-548.e24. doi: 10.1016/j.cell.2018.12.015
  7. Meng L, et al. Evaluation of an automated genome interpretation model for rare disease routinely used in a clinical genetic laboratory. Genet Med. 2023 Jun; 25(6): 100830. doi: 10.1016/j.gim.2023.100830

 

* Secondary analysis run times on HG002 Illumina sequencing data from PrecisionFDA Truth Challenge V2 with 34.46X coverage. DRAGEN was run on a DRAGEN v4 server with a U200 FPGA card and Machine Learning enabled. BWA GATK 4.1.4.0 was run on a local 2x Intel Xeon Gold 6126 (48 threads) with 394 GB RAM and 2TB NVME SSD using BCBIO for parallelization.

 

Illumina Jan 2024 sponsored content QR Code

For Research Use Only.

Not for use in diagnostic procedures.

Learn more illumina.com

 

Previous articleBrain Organoids from Fetal Tissue Take CRISPR Changes, Enable Tumor Modeling
Next articleGut Bacteria Can Protect Stem Cell Transplant Patients from Harmful Immune Reactions