Michael D. Lesh, MD, FACC, Syntegra’s co-founder and CEO

The NIH is partnering with a San Francisco startup to address a key challenge posed by the agency collecting the largest set of COVID-19 patient records: How can access to that repository be broadened for researchers without compromising the privacy of patients who contributed all that data?

Syntegra plans to tackle that challenge by applying its synthetic data engine to the NIH’s National COVID Cohort Collaborative (N3C). The company uses machine learning to create validated “synthetic data”—replicas of healthcare data that are designed to precisely duplicate its statistical properties, with patient privacy protected by removing all links to the original. Syntegra markets its algorithm to a customer base that includes large health systems, life science companies, insurance providers, data scientists, and clinical research organizations.

Syntegra said this week that it will generate and validate a non-identifiable synthetic version of the entire N3C dataset: All 2.6 billion rows of data collected from more than 2.7 million screened individuals, including more than 413,000 COVID-19 positive patients.

Announced by the NIH in June 2020, N3C is designed to be a centralized, secure enclave to store and study medical data from people diagnosed nationwide with COVID-19. N3C is a partnership of the NIH’s Clinical and Translational Science Awards (CTSA) Program hubs and the National Center for Data to Health (CD2H), both supported by the agency’s National Center for advancing Translational Sciences (NCATS), which oversees the Collaborative.

“NCATS wanted to figure out a way to aggregate all of this COVID data that was coming in, in a way that could both harmonize the data among many healthcare systems, but also make it available very quickly to researchers,” Syntegra’s co-founder and CEO, Michael D. Lesh, MD, told GEN.

The synthetic dataset that Syntegra will generate for the NIH will enable the agency to widen access to N3C data, the agency’s largest available repository of patient-level COVID-19 electronic medical records, as well as lay the foundation for greater access to data for life sciences researchers studying other diseases or drug development.

“Democratizing Healthcare Data”

Syntegra is focused on democratizing healthcare data, says Carter Prince, Syntegra’s Head of Business Development.

“We’re learning from an entire dataset and all the billions of relationships between every data point within an underlying dataset,” Prince said. “And that allows us, using our own advanced machine learning techniques, to generate down a new synthetic dataset that has all of the statistical properties of the original dataset.”

More than 70 healthcare organizations worldwide have contributed data to N3C, a public-private collaboration supported by the Bill and Melinda Gates Foundation through the COVID-19 Therapeutic Accelerator. The Foundation in March 2020 joined Wellcome and Mastercard in committing $125 million to launch the Accelerator.

The Gates Foundation has awarded Syntegra contracts of approximately $150,000 and $175,000 toward the development of COVID-19 synthetic data for sharing. The Foundation first connected with Syntegra before the pandemic about creating synthetic versions of clinical trials for its HIV and Maternal, Newborn & Child Health programs, to allow sharing of data without breaching global privacy laws.

“When COVID hit, both the Gates Foundation and us pivoted towards trying to develop the synthetic COVID data,” Lesh recalled.

Syntegra was spun out last year from University of California, San Francisco (UCSF), where Lesh is a professor of medicine and formerly served as Executive Director of Health Technology Innovation in the Office of the Vice Chancellor.

Lesh, a cardiac electrophysiologist, began as an entrepreneur more than two decades ago by co-founding Atrionix, a company created in 1997 to commercialize a catheter he invented to treat atrial fibrillation. Atrionix was acquired in 2000 by Johnson & Johnson’s Cordis unit—now part of Cardinal Health—for $63 million.

After years establishing and leading several medical device companies, in 2017 Lesh returned to UCSF to jumpstart its tech commercialization effort, with a focus on data.

“Universities know how to patent and license molecules or devices,” Lesh said, “but I thought, There’s all this data in our silos. Wouldn’t it be great if we could just take all that data, which is really where I would say the wisdom of patient care sit—all the interactions between the patients and the health system—and we could share that while upholding our ethical obligation to maintain privacy?”

Lesh thought about what could be learned and the new drugs that could be developed, but acknowledges this was “a bit of a naive concept” because “it turns out it’s really, really difficult to share data.”

Privacy Challenge

Privacy laws like the Health Insurance Portability and Accountability Act of 1996 (HIPAA) compelled healthcare stakeholders such as providers and insurers to invest in de-identification technologies that too often failed, leaving them reluctant to share data.

Lesh confronted the problem by pursuing the idea of applying synthetic data to healthcare. The concept had been used with success in settings ranging from manufacturing to autonomous vehicles tested by engineers using simulated “synthetic” miles.

Together with co-founder and Chief Technology Officer Ofer Mendelevitch, Lesh developed a method of creating synthetic data for healthcare. “At that point, we decided to start this company and spin it out from UCSF,” Lesh recalled.

Syntegra’s approach to synthetic data begins with real-world patient-level datasets gathered from “journeys” or sources such as clinical trials, electronic medical records, genomics tests, and claims. The company applies deep generative models trained on data “at rest” or housed in data storage in any digital form, to detect billions of embedded statistical patterns.

Syntegra’s platform includes a pattern encoder within the data provider’s security zone, a large deep neural network (NN) that effectively encodes the relations between all features in the original data. Only NN weights are passed over the firewall where a random-number generator samples from the very high dimensional probability distribution to produce any number of synthetic equivalents.

Once data is trained, synthetic data generators need only their model parameters to generate patient data that Syntegra describes as “realistic but not real.” The resulting synthetic data contains no identifiable information on real individuals, thus does not fall under HIPAA or other privacy regulations, such as the European Union’s General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA).

According to Syntegra, the synthetic data can be immediately utilized for statistical analysis, reporting, and building predictive models with full accuracy and no re-identification risk—all while maintaining individual-level statistical fidelity.

Medical Utility

In “Beyond Differential Privacy: Synthetic Micro-Data Generation with Deep Generative Neural Networks,” an article published last September as part of an open-access peer-reviewed book, Lesh and Mendelevitch offered an overview of synthetic data, noting that three types of medical data can be transformed into synthetic data: tabular data such as data from clinical trials and observational studies; medical imaging diagnostics using MRI, CT and other types of scanning; and electronic medical records.

In medical settings, Lesh and Mendelevitch wrote, synthetic data can be useful by providing an excellent alternative to data that is limited or restricted, enabling a much larger dataset that improves the accuracy of predictive models, and reducing bias in original data. For example, if original data was 60% male and 40% female, gender distribution could be controlled to generate a 50%-50% synthetic dataset.

The co-authors also detailed two of the most common types of generative models used in synthetic data generation. One is variational auto-encoders, types of neural networks that use a form of probability modeling called variational inference to generate a representation of data that is latent or not directly seen, then impose a distribution over latent variables and the data itself.

The other type, generative adversarial networks (GANs), can produce data such as images, text, and music by incorporating competing models: A generator designed to mimic real data, and a discriminator designed to distinguish between real and synthetic data.

“As research in the space of generative models continues at a neck-break pace at companies like OpenAI, Google, Facebook, Microsoft and others, we expect to see tremendous progress in this field on the research side as well as in applications of synthetic data across many areas of industry,” Lesh and Mendelevitch concluded.

Syntegra has received $3.1 million in seed funding from investors that included Hike Ventures, Impact Venture Capital, Innovation Global Capital, Village Global, Wisconn Valley Ventures, and Sweat Equity Ventures, and other unnamed investors.

In-Kind Talent

Sweat Equity Ventures, established by LinkedIn co-founder Reid Hoffman, has also helped Syntegra by supplying approximately $800,000 of in-kind talent, allowing it to tap the talents of software engineers and other professionals without having to hire costly staff early on. Of Syntegra’s dozen staff, half are full-time, the rest hired through Sweat Equity.

In an October 28 presentation posted on YouTube by Impact Venture Capital, Lesh said Syntegra’s 2021 “roadmap” included completion of a Series A round in the latter half of 2021.

Other plans for this year include closing on ‘software-as-a-solution” or SaaS deals with multiple life sciences customers; securing a partnership with a large healthcare system to access their medical records in order to generate synthetic versions of the data; technology improvements to enable acquiring larger, more complex datasets and containerized deployment of the software over multiple platforms; publishing validations of the company’s tech in academic publications; and continued work with government agencies.

In addition to the NIH, Syntegra is working with the FDA to assess how synthetic data can be applied in and outside of COVID-19. That work is still exploratory, based on interest shown in the potential of synthetic data for helping inform agency decisions on approvals of new indications or subpopulations for drugs by FDA Principal Deputy Commissioner Amy Abernethy, MD, PhD.

Since the start of the pandemic, Abernethy and FDA statisticians have met with Syntegra and engaged with N3C. “She has been very interested in synthetic data since well before COVID,” Lesh said. “We’ve been meeting with her and the FDA to see whether synthetic data would be something that they could use in compliance and approval.”

“The goal is—not just for COVID—does this synthetic data work well enough as something that the FDA would consider in their regulatory decisions?”

Previous articleDopamine Circuit Responsible for Temporary Memory Loss
Next articleCannabidiol’s Antibiotic Potential Extends to Gram-Negative Bacteria