In the original hit TV series Lost in Space, a human-like machine called Robot would execute commands only if they made sense. Otherwise, it would flail its arms and exclaim, “That does not compute!” Fortunately, such outbursts are rare among today’s biocomputational platforms, which are spared the difficulties of working with naïfs and rogues. (Remember the wide-eyed Will Robinson and the nefarious Dr. Smith?) Instead, biocomputational platforms can focus on molecules and cells.

Questions about molecules and cells “compute” because biocomputational platforms have access to prodigious statistical power and vast troves of high-quality data. Such resources and new ways to exploit them were discussed at the Pacific Symposium on Biocomputing. The event was held January 3–7 in Waimea, HI, and it showcased the latest work in “databases, algorithms, interfaces, visualization, modeling, and other computational methods, as applied to biological problems, with emphasis on applications in data-rich areas of molecular biology.”

Some of the event’s most interesting discussions are revisited here. Specifically, this article covers the following developments: Protein structural data is being leveraged to reveal how proteins bind small-molecule drugs. Electrostatic field interactions are being assessed to classify protein binding preferences. A new scoring method is being used to improve the mining of single-cell RNA-sequencing (scRNA-seq) data—a challenging task, given that scRNA-seq data is often murky. Finally, better integration of genetic and pharmacogenomic data is revealing sex-specific differences in gene expression and advancing precision medicine generally.

Modeling protein-ligand interactions

Most FDA-approved drugs are protein-binding small molecules. While discovering new candidates can be painfully slow, computational modeling may more efficiently identify candidates. Yun S. Song, PhD, professor of computer science and statistics at the University of California, Berkeley, is investigating the transferability of data from known three-dimensional, geometric patterns of protein self-interactions to protein-ligand interactions.

“Machine learning models of protein-ligand interactions typically struggle because of the relative lack of protein-ligand complex structures to train on,” Song told GEN. “However, proteins interact with small molecules using the same kinds of interactions they use to fold, and there is much more data available on how proteins fold. We wanted to see how easily interactions from folded proteins could be used to understand how proteins bind small-molecule drugs.”

Song and colleagues built off of proof-of-concept work that had been carried out at the University of California, San Francisco, by Nicholas Polizzi, PhD, a postdoctoral researcher, and William DeGrado, PhD, a professor of pharmaceutical chemistry. Song’s group examined different types of atomic interactions important for both protein folding and protein-ligand interactions.

“For each interaction type, we described the observed interactions from protein folding as probability distributions, using geometric parameters and methods for density estimation,” Song elaborated. “We then developed statistical tests to assess which interaction types behave similarly in protein-ligand interactions. We applied our findings to the problem of protein-ligand docking, which is a key component of drug discovery.”

protein-ligand interactions
Researchers at the Center for Computational Structural Biology are using computational docking to study protein-ligand interactions and contribute to drug discovery and development. For example, the researchers developed AutoDock, open source software for the computational docking and virtual screening of small molecules to macromolecular receptors. This AutoDock image shows a drug docking with a protein.

The data demonstrated that not all comparisons proved useful. Some of the comparisons had some level of skewed binding shifts. “Many geometric properties of protein self-interactions can directly inform how proteins should bind to small-molecule drugs, but there are still some distribution shifts that should warrant caution,” Song noted. “By focusing on the interaction types with little shift, we improved the accuracy of a widely used protein-ligand docking method.”

While this approach may not yet be ready for prime time, it holds the promise to provide a flexible statistical framework for further evaluations. Song concluded, “We would like to understand where these shifts come from, so that we can leverage protein folding data for a larger list of interaction types.”

Electrostatic influences

Brian Y. Chen, PhD
Brian Y. Chen, PhD
Associate Professor
Lehigh University

Only a small group of amino acids play central roles in selective binding. Identifying those interactions and, in particular, their mechanisms is critical for understanding processes such as how genetic variations impact pathogenicity. “Proteins achieve selective binding through numerous mechanisms—steric hindrance, electrostatic fields, hydrogen bonding, hydrophobic interactions, and so on,” detailed Brian Y. Chen, PhD, an associate professor of computer science and engineering at Lehigh University.

Chen and his group developed an algorithm called DeepVASP-E that reveals how specificity is influenced by electrostatic mechanisms. “Our method,” he explained, “works by using a neural network to classify an input protein into different binding preference categories, and then by identifying the region of the electrostatic field that was most salient for the category selected.

“The hypothesis of our study was that the region of the field that was most salient for classification would also be the region most important for selective binding. We tested this hypothesis by comparing salient regions to findings from the structural biological literature, and we found that they exhibited very strong agreement.”

At Lehigh University, researchers have developed an algorithm called DeepVASP-E to study how proteins selectively bind ligands. The algorithm was used to generate this image, which classifies the negatively charged salient regions of a bacterial enolase’s binding cavity (transparent yellow region). The most salient regions are identified with red cubes; decreasingly salient regions are shown in solid yellow, green, and then blue. The image also suggests how the binding cavity’s electrostatic gradient is created by the nearby amino acids ASP314, GLU287, and GLU241.

Applications for the method include identifying mutations that alter binding preferences or novel mutations that maintain predicted binding preferences. Thus, the approach could help forecast mutations arising during viral evolution that precipitate vaccine resistance or assist in protein redesign that improves binding specificity.

Chen is seeking collaborators who may be interested in using the method and helping to test it on a larger scale. “We intend to release the code by itself soon,” he declared. He added that the code will eventually be “part of a larger package.”

Mining single-cell data

Courtney Schiebout
Courtney Schiebout
Graduate Student, Dartmouth

The development of single-cell methods, especially single-cell transcriptomics, has dramatically improved and accelerated omics research in the past decade. In particular, teasing out cell types in scRNA-seq is critical for identifying mechanisms and phenotypes. Courtney Schiebout, a graduate student in the Quantitative Biomedical Sciences program at Dartmouth’s Geisel School of Medicine, said that cell typing in scRNA-seq is a challenge in and of itself given the noise and sparsity common to scRNA-seq data. She added that a related challenge is “the assumption that all cells fit into discrete categories rather than existing on a continuum of expression.”

To address both of these challenges, Schiebout and colleagues developed a sophisticated statistical scoring method for scRNA-seq data called Cell typing using variance Adjusted Mahalanobis distances with Multi-Labeling (CAMML).

Schiebout elaborated, “CAMML is a cell-type classifier for scRNA-seq data that can either define a single label for each cell or multiple labels based on how each cell scores for each given cell type. These scores are determined based on genes identified to be positively associated with a certain cell type. The higher a cell’s expression of these genes, the higher the score for that cell type.”

Thus, CAMML helps resolve the challenge of noise and sparsity in scRNA-seq data by using sets of genes to determine cell type rather than individual genes. Schiebout further noted, “CAMML also solves the issue of assuming cells are discretely classifiable by providing an option for multilabeling. We found that CAMML performs better or similarly to existing single-label cell typing methods and that it captures phenotypic information that single-label classification cannot.”

Applications for CAMML also include finding cells that appear to be in transition between discrete cell types. Schiebout said, “Others could certainly utilize CAMML (which can now be accessed in CRAN at” CRAN, or the Comprehensive R Archive Network, is the global repository for open-source biocomputing packages.

CAMML incorporates multiple data sets and possesses diverse analytic capabilities. These attributes, in Schiebout’s view, position CAMML as a “robust and promising method for analyzing scRNA-seq data, particularly in the complex immune compartment of the tumor microenvironment.”

Melding pharmacogenomics and disease genetics

One of the major challenges facing precision medicine today is the knowledge gap between disease genetics and pharmacogenomics (PGx). Whereas the former interprets pathogenicity resulting from genetic variants, the latter seeks to understand the genetic influences on drug responses.

Teri E. Klein, PhD
Teri E. Klein, PhD
Professor, Stanford University

Teri E. Klein, PhD, a professor of biomedical data science and medicine at Stanford University, says both are integral parts of genomic medicine, which encompasses patient care from disease diagnosis through drug prescription. When not considered together, healthcare expenses and treatment efficacy suffer. She noted, “For example, even though clinical genetic tests that are ordered for diagnostic purposes can include findings of PGx variants, the PGx results are not usually reported, in part due to a lack of realization of the knowledge overlap and general lack of PGx knowledge by clinicians.”

To help resolve this issue, Klein and colleagues carried out a quantitative, systematic classification of the knowledge overlap between disease genetics and PGx by investigating the genetic annotations from multiple large-scale, highly curated expert data sources.

The team classified genes based on their pathogenic role or PGx effect as well as on genetic actionability. Klein elaborated, “We identified several genes with strong evidence for an association with both disease status and drug response. Twenty-six genes were found in common, and they could be classified into four categories.”

The categories correspond to the following circumstances:

  1. Exposure to a medication triggers an adverse reaction in a patient carrying certain genetic variants (2 genes).
  2. A medication is contraindicated in a patient carrying certain genetic variants or puts them at increased risk of an adverse reaction (11 genes).
  3. A medication is indicated to treat a patient carrying certain genetic variants (15 genes).
  4. The association is currently unclear because there are no clinical guidelines or FDA-approved drug labels with information (2 genes).

According to Klein, achieving true genomic medicine will require efforts of the wider medical community. “Genetic variants and genes need clinical guidelines that interpret genetic information of their clinical relevance for the implementation of scientific discoveries in medical practice,” she explained. “For an accurate use of genetic information in clinical practice, clinical curation activities will need representation of both disease genetics and PGx experts.”

Evaluating sex-specific differences

Although there are known sex differences in tissue-specific gene expression and in the genetic architecture related to gene expression, these differences don’t always receive the attention they deserve. “Research assessing the genetic basis of observed sex differences in disease prevalence and progression is necessary for facilitating precision medicine,” stated Logan Dumitrescu, PhD, an assistant professor of neurology at Vanderbilt University Medical Center. “Sex differences are typically modeled as a nuisance variable in genetic analyses and are rarely integrated into theoretical and analytical models.”

Dumitrescu and colleagues employ a computational genetics tool called PrediXcan. This tool uses information from publicly available expression quantitative trait loci (eQTL) project atlases to impute tissue-specific gene expression in study participants who have genotype data but not transcriptomic data.

“These predicted gene expression values can then be used to test for associations with disease or disease-related traits,” she explained. “This method overcomes the pitfalls of classic genome-wide association approaches by placing the focus on gene function while increasing statistical power and reducing the total number of comparisons.

“While multiple extensions of PrediXcan have been developed, none have integrated sex into the model building step of PrediXcan, despite evidence of sex differences in tissue-specific gene expression levels in humans. We hypothesized that sex-specific prediction models would facilitate more robust predictions of gene expression.”

Using whole blood transcriptomic data from the Genotype-Tissue Expression (GTEx) project, the scientists built autosomal gene expression prediction models in each sex separately, as well as combined, and evaluated their performance in an independent dataset. “Interestingly, we observed very little evidence of sex-specific genetic regulation,” Dumitrescu reported. “Only 15% of genes showed better prediction in sex-stratified models, and only eight total sex-specific genetic prediction models demonstrated robust evidence of superior performance in stratified compared to joint prediction models.”

“For most genes, we did not see improvement in the prediction quality of sex-aware models,” Dumitrescu continued. “Additional careful evaluation and larger eQTL databases are needed before these methods can be applied with confidence. Moving forward, we are excited to apply these same approaches to other omic datasets, including proteomic and epigenomic databases. We also plan to expand these models to the X-chromosome, which includes many sex-differently expressed genes.”