During the lead optimization phase of biologics drug discovery, researchers use the design, make, test, analyze, and report cycle to make incremental changes to the project’s candidate compounds. As part of this cycle, researchers use experimental data from screening assays to analyze the performance of a range of related candidate compounds. This analysis is used to consider how modifications to the candidate compounds that were performed in the previous iteration of the cycle altered the performance of those compounds in the ensuing screening assays. Interpretation of the results is then used in the next phase of design changes, and so the cycle continues, refining the candidates progressively (one hopes!) until the properties of the best compounds achieve the level required to progress to the next stage of development.

The nexus of data analysis and interpretation, leading to design changes for the next round of experimentation, is a key point in which the scientific knowledge and experience of researchers is at a premium. In some cases, particularly earlier in the lead optimization phase, it may be possible to simply relate results to the prior set of design changes. However, in many cases, the multiple variables and subtle effects of design changes cannot be fully appreciated by simple inspection or basic statistical analyses. In these cases, advanced analytics methods are necessary to extract full meaning from the data.

Advanced analytics vary in what questions they ask and what they attempt to determine. For biologics, these methods are still relatively new in their development, but at least in the cases described here, they are derived from analogous methods that have been successfully applied in small-molecule drug discovery. In this article, we describe some of these methods and what they aim to achieve.

Principal Components Analysis

Principal components analysis (PCA) is a general-purpose statistical analysis technique used widely to analyze high-dimensional data and to allow visualization of structures or sequences in “likeness” space. It is often used for data exploration, as it reduces the dimensionality of data to highlight which variables are most meaningful in the assay results. With respect to drug discovery, it is typically used to group compounds (small molecules or biological sequences) based on concepts of “likeness.” It is in essence, a clustering technique.

In this mode, PCA analysis is used to create groups of compounds that are “like” based on their chemical structure, or biological sequence. The theory is that successful on-market drugs tend to share structural or sequence features. Therefore, if a given candidate compound occupies a similar area in PCA space, then it might be a stronger candidate than another that occupies an area of PCA space away from any on-market drugs. Factoring in assay data can further strengthen the analysis by showing which areas in PCA space are correlated with good performance in target assays (Figure 1).

Figure 1.
Figure 1. Principal Components Analysis (PCA) and plot showing species variation of CRYAB gene sequences. CRYAB codes for the a-crystallin B chain. Mutations in it have been associated with cancer as well as neurodegenerative diseases, such as Alzheimer’s and Parkinson’s diseases. Note that the human samples cluster together (with one outlier) in an area of PCA space that is distinct from a loose cluster of other species.

It should be remembered, though, that because the type of analysis described above is based on what is already known (that is, on-market drugs) it may be hard to find new classes of compounds. A candidate compound that is “out on its own,” in PCA space might not be a bad candidate, but rather the first member of a new class of wonder compounds.

So, as always, while it is not necessary to understand exactly how an analysis is performed, it is important to understand how to interpret the results.

Matched Pair Analysis

Matched pair analysis (MPA; or matched molecular pairs [MMP] when used for small molecules) is a highly localized analysis method in which two compounds differ by a single identifiable change. For small molecules, this means the pairs differ by a single chemical transformation. For sequences, this means a single residue change, insertion, or deletion. These differences can then be correlated with relative assay performance or another drug development-relevant property. The identifiable and localized nature of the differences affords easy interpretation of the effects.

In most cases, because the differences are typically small with respect to the paired compounds, activity performance is commensurately small. However, in some cases, there can be very significant differences (referred to as activity cliffs) that indicate highly relevant substitutions and/or sensitive positions within the paired compounds. These can inform design choices going forward.

Although MPA is in general, a new and emerging technique for biologic drug discovery, the established technique of alanine scanning may be thought of as a form of MPA. In alanine scanning, the amino acid residues of a native protein are mutated to alanine, progressively along the sequence (Figure 2).

Figure 2
Figure 2. Alanine scan, a form of matched pair analysis. A set of mutated sequences (sequence index: 1 to 20) shows how a single residue has been mutated to alanine (sequence offset: 2 to 21). Points are sized by IC50 value. The chart shows that positions 6 (histidine), 3 (glutamic acid), and to a lesser extent 7 (cysteine) contained residues that were the most sensitive
to the mutation.

That is, in the first mutated protein, position 1 is converted to alanine; in the second mutated protein, position 2 is mutated to alanine (and position 1 remains as it is in the native protein); and so on until there is a variant of the native protein where each (non-alanine) position has been mutated. Each variant is then compared to the native protein (that is, a matched pair) to assess the impact on structure and function. Alanine is typically chosen because it has a small side chain that is chemically inert, and so therefore is used to assess which residues in the native protein provide significant functional or structural influences.

Biological Sequence-Activity Relationships

Biological sequence-activity relationship (BSAR) analysis is the sequence equivalent of quantitative structure-activity relationship (QSAR) analysis. In QSAR, which is well established in small molecule drug discovery, the relationship between functional groups of small molecules and performance in screening assays is assessed, with the fundamental assumption being that common properties or features should provide similar assay results. Establishment of QSAR trends can be used to provide guidance for compound design. QSAR may be thought of as a generalization of MPA, summarizing the contributions of multiple changes.

BSAR is an emerging analogous method for sequence-based compounds. Point mutations, insertions, or deletions, especially in critical areas of a sequence such as within the complementary determining region (CDR) or surrounding framework regions of an antibody chain, will likely affect the performance of sequences in screening assays. BSAR can be used to assess the collective impact of identified mutations in specific regions of the sequence, providing input to design considerations in the next cycle of lead optimization.

Accessing the Methods

The techniques described above are nonproprietary and have been published, both for small-molecule and biologics discovery. Implementation can be done from scratch, by finding suitable open-source code, or by using commercial products. Dotmatics provides all three techniques for biologics drug discovery within its Vortex product for data analysis and visualization.


The ability to make informed design decisions is critical to successful drug discovery programs. The techniques described here are valuable additions to other analytical methods and statistics, allowing complex and high-dimensional data to be interpreted in an objective manner. This increases the likelihood of successful design changes for the next cycle of lead optimization.


Andrew LeBeau, Ph.D. ([email protected]), is senior manager of biologics marketing at Dotmatics.