Including prediction stability as a parameter when using machine learning to design functional biological sequences can significantly improve the selection pool of antibodies, according to a recent paper.
Scientists outlined a whole-spectrum approach to improve black-box optimization in which activity and stability are balanced and prediction stability and other development parameters are considered. Including prediction stability enables the machine learning algorithms to identify promising sequences unrelated to the training data, thus creating a more diverse pool of potential antibodies that incorporate key criteria.
This whole-spectrum approach to prediction uncertainty may be particularly applicable to automatic design in situations with limited data.
“When prediction uncertainty is ignored, machine learning creates any sequences that achieve high scores, but they are unlikely to be successful,” Koji Tsuda, PhD, professor, department of computational biology and medical sciences, University of Tokyo, tells GEN. “When the uncertainty is respected too much, machine learning creates sequences close to the training sequences [to enhance] safety. Our method takes the balance between the two extremes.”
Average activity and standard deviation
Led by Tsuda and Mitsuo Umetsu, PhD, professor of biomolecular engineering, Tohoku University, the scientists first designed VHH antibodies (the antigen-binding fragments of heavy-chain-only antibodies) against galectin-3, which is implicated in cancer and other diseases. They then trained multiple prediction models by subsampling the training set and the multi-objective optimization problem. Their objectives were to determine average activity and standard deviation.
A multi-objective optimization solver then optimized results around three parameters: the average of the prediction scores, the standard deviation of those scores, and solubility as predicted by NetSolP.
“Our approach is reminiscent of bagging, where multiple prediction models are created by subsampling the training data set, and the average of prediction scores are used for making decisions for new examples,” they elaborated. The main difference between this approach and bagging is that their method improves black-box optimization, while bagging improves prediction accuracy.
From the original 19,778 sequences they designed, five were chosen for wet lab validation. “One sequence, 16 mutations away from the closest training sequence, was successfully expressed and found to possess desired binding specificity,” they noted.
This research used the multi-objective optimization by quantum annealing (MOQA) software developed in-house although, Tsuda points out, “One can use any off-the-shelf multi-objective optimizer and include prediction uncertainty as one of the objectives.”