By Fay Lin, PhD
In a new study “Illuminating protein space with a programmable generative model,” published in Nature, researchers present a generative artificial intelligence (AI) model, named Chroma, that creates novel proteins not previously found in nature with programmable properties for therapeutic potential and demonstrated experimental success in the lab.
The work comes from Generate:Biomedicines, a company based in Somerville, MA, which works at the intersection of machine learning, biological engineering and medicine with an emphasis on protein design.
“We’ve been working on generative models of proteins since Day 1. That’s why our name is Generate!” exclaimed Gevorg Grigoryan, PhD, co-founder and chief technology officer of Generate:Biomedicines.
Prior to the AI revolution, protein design approaches were limited to generating designs based on nature’s existing proteins, which presented limitations as nature has only sampled a small subset of the possible protein landscape. In contrast, generative AI approaches emphasize de novo protein design—designing new proteins from scratch—to expand the repertoire of functions and desirable attributes beyond what nature has achieved.
Chroma is documented to design proteins under external constraints, which can involve symmetries, substructure, shape, and even natural-language prompts. Experimental characterization of 310 proteins generated from Chroma resulted in proteins that express, fold, and possess favorable biophysical properties.
Grigoryan noted that programmability was integral to Chroma’s framework from the get-go, as producing therapeutic applications requires more than generating structures that can be experimentally validated. Evaluating protein function, such as binding, allosteric control, and enzymatic activity, is critical for therapeutic potential.
In addition, Grigoryan highlights that one novelty of the study was a shifted paradigm when thinking about experimental validation in a protein design campaign.
“Instead of the goal being ‘I want the protein to work,’ our goal was to characterize the model. We wanted to understand how much of what Chroma learned was real versus not real,” Grigoryan told GEN.
When deciding which computational structures to experimentally validate, common methodology involves a filtering step, in which protein designers critique designs based on their understanding of biophysical structure, such as penalizing overrepresentation of hydrophobic regions due to solubility concerns.
Grigoryan told GEN that the 310 proteins chosen for experimental validation were taken directly from the model output and not filtered in this way.
“From those proteins, we saw a massively high success rate, which of course is very exciting because it suggests that this large protein space parameterized by Chroma is real [and allows for more effective protein design],” Grigoryan continued.
Making data work for you
The protein design field’s traditional “bottom up” approach, which simulated protein behavior based on biophysical dynamics of atoms, was logically “fine and consistent” but hasn’t led to the advances that are now achievable with machine learning.
Rather than starting with first principles and evaluating whether simulations are accurate, machine learning approaches start with observations and infer the principles that led to those observations.
“Machine learning tools can make data work for you,” stated Grigoryan.
Specifically, Chroma leverages diffusion models, a machine learning tool that has seen considerable success in image generation tools, such as Midjourney, DALL-E 2 from OpenAI, and Stable Diffusion from Stability AI. These generative models learn the patterns of their training data and generate new outputs with similar characteristics.
Grigoryan emphasizes that this framework makes Chroma malleable to the introduction of new programmable conditions.
“It’s very easy to create a model for new properties and plug it into Chroma. Similar to DALL-E image generators, you don’t have to create a separate model of images for animals, the beach, and mountains. You can just tell the model, ‘I would like a panda dancing on the beach with a sombrero’ and it can generate that for you,” described Grigoryan.
Chroma is not the only generative AI tool leveraging diffusion models for protein design. In July, the lab of David Baker, PhD, professor in biochemistry and the director of the University of Washington (UW) Institute for Protein Design (IPD), published their diffusion model, RoseTTAFold diffusion (RFdiffusion), which demonstrated strong experimental validation and ease of use, in Nature.
“So far, [Chroma] has only been experimentally demonstrated to design new structures, but could likely be adapted to design new protein, peptide and small molecule interactions as has been demonstrated by RFdiffusion,” Baker told GEN.
Along this vein, Grigoryan notes that an effective protein design model is only one piece to the broader process of therapeutic discovery.
“Chroma is a model and not a drug printer. There’s so much more that goes into making therapeutics, which can be resource intensive and involves a very close integration between the wet and dry lab,” said Grigoryan.
Open to all
Generate:Biomedicines has made the code behind Chroma available as open-source software for all researchers across academia and industry.
“Our intention was to go open source before we posted the preprint. From a societal perspective, it would not feel right to stand in the way of what [Chroma] can do for advancing biomedical science, but also other applications, such as nanotechnology and material science,” explained Grigoryan.
From a company perspective, Grigoryan also noted that the ability to continue at the forefront of science is linked to a company’s ability to attract and retain the best talent. Sharing this work is a key action to contributing to the research community.
“It’s great that Generate:Biomedicines is making Chroma available to the scientific community!” said Baker. Baker also indicates that the community will benefit from having multiple generative protein design models to explore.
While diffusion models are the “flavor of the moment,” new protein design tools are expected to infiltrate a rapidly growing field.
“Now that the code is available, [the community] is certainly free to build on it and create better versions. I expect and hope that’s exactly what happens,” stated Grigoryan.
Fay Lin, PhD, is senior editor for GEN Biotechnology.