The “ChatGPT moment” for biology proceeds to unfold as protein language models, or machine learning tools trained on large databases of protein sequences, work to decode the language of life with the goal of generating new proteins for widespread applications across therapeutics, sustainability, and more. Profluent, an artificial intelligence (AI) protein design company based in Berkeley, CA, has now taken one step closer toward steering these models for specific functional tasks with atomistic level control.
“There’s been a lot of work right now on building foundational models for biology as a whole,” said Ali Madani, PhD, CEO of Profluent, in an interview with GEN Edge. “How do we train these large generative models to learn from the underlying patterns that nature has provided us?”
In a preprint posted on bioRxiv, Profluent introduces a new method which now incorporates structural and functional context to protein language models for conditioned design. The approach, termed proseLM (protein structure-encoded language model), was experimentally validated to be effective in improving gene editing activity and therapeutic antibody binding affinity, two complex functional protein design tasks with broad applications across biotechnological research.
“We’re moving away from discovery-based techniques and toward precise, steerable control, and intentional design for the challenges that we see in society today,” said Madani.
Jeffrey Ruffolo, PhD, lead author of the proseLM preprint and head of protein design at Profluent said the team aimed to evaluate how proseLM compared with traditional approaches, such as directed evolution or manual optimization in the case of antibodies.
“We found that even with just one round of optimization, we can match some of the best base editors out there. For antibodies, we can even get better binding than nivolumab, which is a clinically approved antibody therapeutic,” Ruffolo told GEN Edge.
In proseLM, structural and functional information, including non-protein interactions with nucleic acids, ligands, and ions, is introduced into a pre-trained language model through a set of added layers, called adapters. Notably, these adapter layers have much fewer parameters compared to the language model, making these models efficient to train and run.
Toward broader functionalities
Profluent launched with a $9 million seed round in 2023 and secured an additional $35 million financing in 2024. The company was founded on the principle of AI as a tool to decode the language of life to support the protein engineering paradigm shift from accidental discovery to intentional design. Madani, who led machine learning research initiatives at Salesforce Research prior to founding Profluent, emphasizes the company’s evolutionary approach to protein design, which learns the patterns of natural sequences evolved for similar functions to inform design space.
Profluent has pointed their design platform toward CRISPR and gene editing. In April, the company demonstrated successful precision editing with a programmable gene editor designed with AI, named OpenCRISPR-1. OpenCRISPR-1 has been released publicly for broad and ethical usage across research and commercial applications. ProseLM now expands Profluent’s toolkit from designing within specialized protein families to broader functionalities.
Profluent is not the only player leveraging language models for protein design. Earlier this summer, EvolutionaryScale, a biology AI company founded by former Meta AI researchers came out of stealth with a $142 million seed round and announced ESM3, a language model shown to generate a new green fluorescent protein (GFP) with only 58% similarity to the closest known fluorescent protein.
Two sides of the same coin
Evolutionary approaches for protein design contrast with structure-based methods, where a protein structure is given and the goal is to find a sequence that folds onto the structure. Structure-based design algorithms often require explicit instructions for defining function, which allows for more fine-tuned control.
“[In structure-based approaches], if you want a protein to bind a target, you need to figure out what the structure will look like exactly,” said Ruffolo. “That’s restrictive for applications like gene editors, where you have these large proteins that have many different functions that they need to do in sequence.”
Ruffolo describes both approaches as “two sides of the same coin.” While one side is “reading” biology by taking the sequence and determining the structure, the other is “writing” biology by generating a new protein that fits into a specific context.
“[With proseLM], we can take the fine control of structure-based approaches and the broad scope of sequence -based approaches to look at the best of both worlds,” Ruffolo continued.
ProseLM is one example of the field’s ongoing movement from designing proteins in a vacuum toward broader integration of biological context. In May, Google DeepMind in collaboration of Isomorphic Labs published AlphaFold 3 in Nature. This update expands the renowned protein structure prediction algorithm’s predictive capabilities from proteins to a broad spectrum of biomolecular interactions, including DNA, RNA, ligands, and more. To the public’s disappointment, AlphaFold 3 was released without the open-source code and is only accessible as a web server with limits in functionality.
Madani stated that proseLM has proven to be a powerful tool within the company’s hands and will be a strong addition to Profluent’s toolkit going forward. Profluent is releasing proseLM to the public for non-commercial use and looks forward to the community’s feedback. With this new tool to write the language of biology, time will tell what applications lie in the next chapter.
Fay Lin, PhD, is senior editor for GEN Biotechnology.