skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

Title: Deep learning the structural determinants of protein biochemical properties by comparing structural ensembles with DiffNets

Understanding the structural determinants of a protein’s biochemical properties, such as activity and stability, is a major challenge in biology and medicine. Comparing computer simulations of protein variants with different biochemical properties is an increasingly powerful means to drive progress. However, success often hinges on dimensionality reduction algorithms for simplifying the complex ensemble of structures each variant adopts. Unfortunately, common algorithms rely on potentially misleading assumptions about what structural features are important, such as emphasizing larger geometric changes over smaller ones. Here we present DiffNets, self-supervised autoencoders that avoid such assumptions, and automatically identify the relevant features, by requiring that the low-dimensional representations they learn are sufficient to predict the biochemical differences between protein variants. For example, DiffNets automatically identify subtle structural signatures that predict the relative stabilities of β-lactamase variants and duty ratios of myosin isoforms. DiffNets should also be applicable to understanding other perturbations, such as ligand binding.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Although there are a large number of structural variations in the chromosomes of each individual, there is a lack of more accurate methods for identifying clinical pathogenic variants. Here, we proposed SVPath, a machine learning-based method to predict the pathogenicity of deletions, insertions and duplications structural variations that occur in exons. We constructed three types of annotation features for each structural variation event in the ClinVar database. First, we treated complex structural variations as multiple consecutive single nucleotide polymorphisms events, and annotated them with correlation scores based on single nucleic acid substitutions, such as the impact on protein function. Second, we determined which genes the variation occurred in, and constructed gene-based annotation features for each structural variation. Third, we also calculated related features based on the transcriptome, such as histone signal, the overlap ratio of variation and genomic element definitions, etc. Finally, we employed a gradient boosting decision tree machine learning method, and used the deletions, insertions and duplications in the ClinVar database to train a structural variation pathogenicity prediction model SVPath. These structural variations are clearly indicated as pathogenic or benign. Experimental results show that our SVPath has achieved excellent predictive performance and outperforms existing state-of-the-art tools. SVPath is very promising in evaluating the clinical pathogenicity of structural variants. SVPath can be used in clinical research to predict the clinical significance of unknown pathogenicity and new structural variation, so as to explore the relationship between diseases and structural variations in a computational way.

    more » « less
  2. Abstract

    Structural, regulatory and enzymatic proteins interact with DNA to maintain a healthy and functional genome. Yet, our structural understanding of how proteins interact with DNA is limited. We present MELD-DNA, a novel computational approach to predict the structures of protein–DNA complexes. The method combines molecular dynamics simulations with general knowledge or experimental information through Bayesian inference. The physical model is sensitive to sequence-dependent properties and conformational changes required for binding, while information accelerates sampling of bound conformations. MELD-DNA can: (i) sample multiple binding modes; (ii) identify the preferred binding mode from the ensembles; and (iii) provide qualitative binding preferences between DNA sequences. We first assess performance on a dataset of 15 protein–DNA complexes and compare it with state-of-the-art methodologies. Furthermore, for three selected complexes, we show sequence dependence effects of binding in MELD predictions. We expect that the results presented herein, together with the freely available software, will impact structural biology (by complementing DNA structural databases) and molecular recognition (by bringing new insights into aspects governing protein–DNA interactions).

    more » « less
  3. Abstract Motivation

    Genetic variation that disrupts gene function by altering gene splicing between individuals can substantially influence traits and disease. In those cases, accurately predicting the effects of genetic variation on splicing can be highly valuable for investigating the mechanisms underlying those traits and diseases. While methods have been developed to generate high quality computational predictions of gene structures in reference genomes, the same methods perform poorly when used to predict the potentially deleterious effects of genetic changes that alter gene splicing between individuals. Underlying that discrepancy in predictive ability are the common assumptions by reference gene finding algorithms that genes are conserved, well-formed and produce functional proteins.


    We describe a probabilistic approach for predicting recent changes to gene structure that may or may not conserve function. The model is applicable to both coding and non-coding genes, and can be trained on existing gene annotations without requiring curated examples of aberrant splicing. We apply this model to the problem of predicting altered splicing patterns in the genomes of individual humans, and we demonstrate that performing gene-structure prediction without relying on conserved coding features is feasible. The model predicts an unexpected abundance of variants that create de novo splice sites, an observation supported by both simulations and empirical data from RNA-seq experiments. While these de novo splice variants are commonly misinterpreted by other tools as coding or non-coding variants of little or no effect, we find that in some cases they can have large effects on splicing activity and protein products and we propose that they may commonly act as cryptic factors in disease.

    Availability and implementation

    The software is available from

    Supplementary information

    Supplementary information is available at Bioinformatics online.

    more » « less
  4. Objective

    To define a distinct, dominantly inherited, mild skeletal myopathy associated with prominent and consistent tremor in two unrelated, three‐generation families.


    Clinical evaluations as well as exome and panel sequencing analyses were performed in affected and nonaffected members of two families to identify genetic variants segregating with the phenotype. Histological assessment of a muscle biopsy specimen was performed in 1 patient, and quantitative tremor analysis was carried out in 2 patients. Molecular modeling studies and biochemical assays were performed for both mutations.


    Two novel missense mutations inMYBPC1(p.E248K in family 1 and p.Y247H in family 2) were identified and shown to segregate perfectly with the myopathy/tremor phenotype in the respective families.MYBPC1encodes slow myosin binding protein‐C (sMyBP‐C), a modular sarcomeric protein playing structural and regulatory roles through its dynamic interaction with actin and myosin filaments. The Y247H and E248K mutations are located in the NH2‐terminal M‐motif of sMyBP‐C. Both mutations result in markedly increased binding of the NH2terminus to myosin, possibly interfering with normal cross‐bridge cycling as the first muscle‐based step in tremor genesis. The clinical tremor features observed in all mutation carriers, together with the tremor physiology studies performed in family 2, suggest amplification by an additional central loop modulating the clinical tremor phenomenology.


    Here, we link two novel missense mutations inMYBPC1with a dominant, mild skeletal myopathy invariably associated with a distinctive tremor. The molecular, genetic, and clinical studies are consistent with a unique sarcomeric origin of the tremor, which we classify as “myogenic tremor.” ANN NEUROL 2019

    more » « less
  5. DNA damage and repair are widely studied in relation to cancer and therapeutics. Y-family DNA polymerases can bypass DNA lesions, which may result from external or internal DNA damaging agents, including some chemotherapy agents. Overexpression of Y-family polymerase human pol kappa can result in tumorigenesis and drug resistance in cancer. This report describes the use of computational tools to predict the effects of single nucleotide polymorphism variants on pol kappa activity. Partial Order Optimum Likelihood (POOL), a machine learning method that uses input features from Theoretical Microscopic Titration Curve Shapes (THEMATICS), was used to identify amino acid residues most likely involved in catalytic activity. The μ4 value, a metric obtained from POOL and THEMATICS that serves as a measure of the degree of coupling between one ionizable amino acid and its neighbors, was then used to identify which protein mutations are likely to impact the biochemical activity. Bioinformatic tools SIFT, PolyPhen-2, and FATHMM predicted most of these variants to be deleterious to function. Along with computational and bioinformatic predictions, we characterized the catalytic activity and stability of seventeen cancer-associated DNA pol kappa variants. We identified pol kappa variants R48I, H105Y, G147D, G154E, V177L, R298C, E362V, and R470C as having lower activity relative to wild-type pol kappa; the pol kappa variants T102A, H142Y, R175Q, E210K, Y221C, N330D, N338S, K353T, and L383F were identified as similar in catalytic efficiency to WT pol kappa. We observed that POOL predictions can be used to predict which variants have decreased activity. Predictions from bioinformatic tools like SIFT, PolyPhen-2 and FATHMM are based on sequence comparisons and therefore are complementary to POOL but are less capable of predicting biochemical activity. These bioinformatic and computational tools can be used to identify SNP variants with deleterious effects and altered biochemical activity from a large data set. 
    more » « less