skip to main content

This content will become publicly available on September 26, 2024

Title: Prediction and design of protease enzyme specificity using a structure-aware graph convolutional network

Site-specific proteolysis by the enzymatic cleavage of small linear sequence motifs is a key posttranslational modification involved in physiology and disease. The ability to robustly and rapidly predict protease–substrate specificity would also enable targeted proteolytic cleavage by designed proteases. Current methods for predicting protease specificity are limited to sequence pattern recognition in experimentally derived cleavage data obtained for libraries of potential substrates and generated separately for each protease variant. We reasoned that a more semantically rich and robust model of protease specificity could be developed by incorporating the energetics of molecular interactions between protease and substrates into machine learning workflows. We present Protein Graph Convolutional Network (PGCN), which develops a physically grounded, structure-based molecular interaction graph representation that describes molecular topology and interaction energetics to predict enzyme specificity. We show that PGCN accurately predicts the specificity landscapes of several variants of two model proteases. Node and edge ablation tests identified key graph elements for specificity prediction, some of which are consistent with known biochemical constraints for protease:substrate recognition. We used a pretrained PGCN model to guide the design of protease libraries for cleaving two noncanonical substrates, and found good agreement with experimental cleavage results. Importantly, the model can accurately assess designs featuring diversity at positions not present in the training data. The described methodology should enable the structure-based prediction of specificity landscapes of a wide variety of proteases and the construction of tailor-made protease editors for site-selectively and irreversibly modifying chosen target proteins.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
National Academy of Sciences
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Biophysical interactions between proteins and peptides are key determinants of molecular recognition specificity landscapes. However, an understanding of how molecular structure and residue-level energetics at protein−peptide interfaces shape these landscapes remains elusive. We combine information from yeast-based library screening, next-generation sequencing, and structure-based modeling in a supervised machine learning approach to report the comprehensive sequence−energetics−function mapping of the specificity landscape of the hepatitis C virus (HCV) NS3/4A protease, whose function—site-specific cleavages of the viral polyprotein—is a key determinant of viral fitness. We screened a library of substrates in which five residue positions were randomized and measured cleavability of ∼30,000 substrates (∼1% of the library) using yeast display and fluorescence-activated cell sorting followed by deep sequencing. Structure-based models of a subset of experimentally derived sequences were used in a supervised learning procedure to train a support vector machine to predict the cleavability of 3.2 million substrate variants by the HCV protease. The resulting landscape allows identification of previously unidentified HCV protease substrates, and graph-theoretic analyses reveal extensive clustering of cleavable and uncleavable motifs in sequence space. Specificity landscapes of known drug-resistant variants are similarly clustered. The described approach should enable the elucidation and redesign of specificity landscapes of a wide variety of proteases, including human-origin enzymes. Our results also suggest a possible role for residue-level energetics in shaping plateau-like functional landscapes predicted from viral quasispecies theory.

    more » « less
  2. Ubiquitin-like proteins (Ubls) share some features with ubiquitin (Ub) such as their globular 3D structure and the ability to attach covalently to other proteins. Interferon Stimulated Gene 15 (ISG15) is an abundant Ubl that similar to Ub, marks many hundreds of cellular proteins, altering their fate. In contrast to Ub, , ISG15 requires interferon (IFN) induction to conjugate efficiently to other proteins. Moreover, despite the multitude of E3 ligases for Ub-modified targets, a single E3 ligase termed HERC5 (in humans) is responsible for the bulk of ISG15 conjugation. Targets include both viral and cellular proteins spanning an array of cellular compartments and metabolic pathways. So far, no common structural or biochemical feature has been attributed to these diverse substrates, raising questions about how and why they are selected. Conjugation of ISG15 mitigates some viral and bacterial infections and is linked to a lower viral load pointing to the role of ISG15 in the cellular immune response. In an apparent attempt to evade the immune response, some viruses try to interfere with the ISG15 pathway. For example, deconjugation of ISG15 appears to be an approach taken by coronaviruses to interfere with ISG15 conjugates. Specifically, coronaviruses such as SARS-CoV, MERS-CoV, and SARS-CoV-2, encode papain-like proteases (PL1pro) that bear striking structural and catalytic similarities to the catalytic core domain of eukaryotic deubiquitinating enzymes of the Ubiquitin-Specific Protease (USP) sub-family. The cleavage specificity of these PLpro enzymes is for flexible polypeptides containing a consensus sequence (R/K)LXGG, enabling them to function on two seemingly unrelated categories of substrates: (i) the viral polyprotein 1 (PP1a, PP1ab) and (ii) Ub- or ISG15-conjugates. As a result, PLpro enzymes process the viral polyprotein 1 into an array of functional proteins for viral replication (termed non-structural proteins; NSPs), and it can remove Ub or ISG15 units from conjugates. However, by de-conjugating ISG15, the virus also creates free ISG15, which in turn may affect the immune response in two opposite pathways: free ISG15 negatively regulates IFN signaling in humans by binding non-catalytically to USP18, yet at the same time free ISG15 can be secreted from the cell and induce the IFN pathway of the neighboring cells. A deeper understanding of this protein-modification pathway and the mechanisms of the enzymes that counteract it will bring about effective clinical strategies related to viral and bacterial infections 
    more » « less
  3. Ribozymes are RNA molecules that catalyze biochemical reactions. Self-cleaving ribozymes are a common naturally occurring class of ribozymes that catalyze site-specific cleavage of their own phosphodiester backbone. In addition to their natural functions, self-cleaving ribozymes have been used to engineer control of gene expression because they can be designed to alter RNA processing and stability. However, the rational design of ribozyme activity remains challenging, and many ribozyme-based systems are engineered or improved by random mutagenesis and selection ( in vitro evolution). Improving a ribozyme-based system often requires several mutations to achieve the desired function, but extensive pairwise and higher-order epistasis prevent a simple prediction of the effect of multiple mutations that is needed for rational design. Recently, high-throughput sequencing-based approaches have produced data sets on the effects of numerous mutations in different ribozymes (RNA fitness landscapes). Here we used such high-throughput experimental data from variants of the CPEB3 self-cleaving ribozyme to train a predictive model through machine learning approaches. We trained models using either a random forest or long short-term memory (LSTM) recurrent neural network approach. We found that models trained on a comprehensive set of pairwise mutant data could predict active sequences at higher mutational distances, but the correlation between predicted and experimentally observed self-cleavage activity decreased with increasing mutational distance. Adding sequences with increasingly higher numbers of mutations to the training data improved the correlation at increasing mutational distances. Systematically reducing the size of the training data set suggests that a wide distribution of ribozyme activity may be the key to accurate predictions. Because the model predictions are based only on sequence and activity data, the results demonstrate that this machine learning approach allows readily obtainable experimental data to be used for RNA design efforts even for RNA molecules with unknown structures. The accurate prediction of RNA functions will enable a more comprehensive understanding of RNA fitness landscapes for studying evolution and for guiding RNA-based engineering efforts. 
    more » « less
  4. Abstract Motivation

    Proteases are enzymes that cleave target substrate proteins by catalyzing the hydrolysis of peptide bonds between specific amino acids. While the functional proteolysis regulated by proteases plays a central role in the ‘life and death’ cellular processes, many of the corresponding substrates and their cleavage sites were not found yet. Availability of accurate predictors of the substrates and cleavage sites would facilitate understanding of proteases’ functions and physiological roles. Deep learning is a promising approach for the development of accurate predictors of substrate cleavage events.


    We propose DeepCleave, the first deep learning-based predictor of protease-specific substrates and cleavage sites. DeepCleave uses protein substrate sequence data as input and employs convolutional neural networks with transfer learning to train accurate predictive models. High predictive performance of our models stems from the use of high-quality cleavage site features extracted from the substrate sequences through the deep learning process, and the application of transfer learning, multiple kernels and attention layer in the design of the deep network. Empirical tests against several related state-of-the-art methods demonstrate that DeepCleave outperforms these methods in predicting caspase and matrix metalloprotease substrate-cleavage sites.

    Availability and implementation

    The DeepCleave webserver and source code are freely available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  5. Enzymes are catalysts in biochemical reactions that, by definition, increase rates of reactions without being altered or destroyed. However, when that enzyme is a protease, a subclass of enzymes that hydrolyze other proteins, and that protease is in a multiprotease system, protease-as-substrate dynamics must be included, challenging assumptions of enzyme inertness, shifting kinetic predictions of that system. Protease-on-protease inactivating hydrolysis can alter predicted protease concentrations used to determine pharmaceutical dosing strategies. Cysteine cathepsins are proteases capable of cathepsin cannibalism, where one cathepsin hydrolyzes another with substrate present, and misunderstanding of these dynamics may cause miscalculations of multiple proteases working in one proteolytic network of interactions occurring in a defined compartment. Once rates for individual protease-on-protease binding and catalysis are determined, proteolytic network dynamics can be explored using computational models of cooperative/competitive degradation by multiple proteases in one system, while simultaneously incorporating substrate cleavage. During parameter optimization, it was revealed that additional distraction reactions, where inactivated proteases become competitive inhibitors to remaining, active proteases, occurred, introducing another network reaction node. Taken together, improved predictions of substrate degradation in a multiple protease network were achieved after including reaction terms of autodigestion, inactivation, cannibalism, and distraction, altering kinetic considerations from other enzymatic systems, since enzyme can be lost to proteolytic degradation. We compiled and encoded these dynamics into an online platform ( for individual users to test hypotheses of specific perturbations to multiple cathepsins, substrates, and inhibitors, and predict shifts in proteolytic network reactions and system dynamics.

    more » « less