Site-specific proteolysis by the enzymatic cleavage of small linear sequence motifs is a key posttranslational modification involved in physiology and disease. The ability to robustly and rapidly predict protease–substrate specificity would also enable targeted proteolytic cleavage by designed proteases. Current methods for predicting protease specificity are limited to sequence pattern recognition in experimentally derived cleavage data obtained for libraries of potential substrates and generated separately for each protease variant. We reasoned that a more semantically rich and robust model of protease specificity could be developed by incorporating the energetics of molecular interactions between protease and substrates into machine learning workflows. We present Protein Graph Convolutional Network (PGCN), which develops a physically grounded, structure-based molecular interaction graph representation that describes molecular topology and interaction energetics to predict enzyme specificity. We show that PGCN accurately predicts the specificity landscapes of several variants of two model proteases. Node and edge ablation tests identified key graph elements for specificity prediction, some of which are consistent with known biochemical constraints for protease:substrate recognition. We used a pretrained PGCN model to guide the design of protease libraries for cleaving two noncanonical substrates, and found good agreement with experimental cleavage results. Importantly, the model can accurately assess designs featuring diversity at positions not present in the training data. The described methodology should enable the structure-based prediction of specificity landscapes of a wide variety of proteases and the construction of tailor-made protease editors for site-selectively and irreversibly modifying chosen target proteins.
more »
« less
Data-driven supervised learning of a viral protease specificity landscape from deep sequencing and molecular simulations
Biophysical interactions between proteins and peptides are key determinants of molecular recognition specificity landscapes. However, an understanding of how molecular structure and residue-level energetics at protein−peptide interfaces shape these landscapes remains elusive. We combine information from yeast-based library screening, next-generation sequencing, and structure-based modeling in a supervised machine learning approach to report the comprehensive sequence−energetics−function mapping of the specificity landscape of the hepatitis C virus (HCV) NS3/4A protease, whose function—site-specific cleavages of the viral polyprotein—is a key determinant of viral fitness. We screened a library of substrates in which five residue positions were randomized and measured cleavability of ∼30,000 substrates (∼1% of the library) using yeast display and fluorescence-activated cell sorting followed by deep sequencing. Structure-based models of a subset of experimentally derived sequences were used in a supervised learning procedure to train a support vector machine to predict the cleavability of 3.2 million substrate variants by the HCV protease. The resulting landscape allows identification of previously unidentified HCV protease substrates, and graph-theoretic analyses reveal extensive clustering of cleavable and uncleavable motifs in sequence space. Specificity landscapes of known drug-resistant variants are similarly clustered. The described approach should enable the elucidation and redesign of specificity landscapes of a wide variety of proteases, including human-origin enzymes. Our results also suggest a possible role for residue-level energetics in shaping plateau-like functional landscapes predicted from viral quasispecies theory.
more »
« less
- Award ID(s):
- 1716623
- PAR ID:
- 10082246
- Publisher / Repository:
- Proceedings of the National Academy of Sciences
- Date Published:
- Journal Name:
- Proceedings of the National Academy of Sciences
- Volume:
- 116
- Issue:
- 1
- ISSN:
- 0027-8424
- Page Range / eLocation ID:
- p. 168-176
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Three protein targets from SARS-CoV-2, the viral pathogen that causes COVID-19, are studied: the main protease, the 2′-O-RNA methyltransferase, and the nucleocapsid (N) protein. For the main protease, the nucleophilicity of the catalytic cysteine C145 is enabled by coupling to three histidine residues, H163 and H164 and catalytic dyad partner H41. These electrostatic couplings enable significant population of the deprotonated state of C145. For the RNA methyltransferase, the catalytic lysine K6968 that serves as a Brønsted base has significant population of its deprotonated state via strong coupling with K6844 and Y6845. For the main protease, Partial Order Optimum Likelihood (POOL) predicts two clusters of biochemically active residues; one includes the catalytic H41 and C145 and neighboring residues. The other surrounds a second pocket adjacent to the catalytic site and includes S1 residues F140, L141, H163, E166, and H172 and also S2 residue D187. This secondary recognition site could serve as an alternative target for the design of molecular probes. From in silico screening of library compounds, ligands with predicted affinity for the secondary site are reported. For the NSP16-NSP10 complex that comprises the RNA methyltransferase, three different sites are predicted. One is the catalytic core at the conserved K-D-K-E motif that includes catalytic residues D6928, K6968, and E7001 plus K6844. The second site surrounds the catalytic core and consists of Y6845, C6849, I6866, H6867, F6868, V6894, D6895, D6897, I6926, S6927, Y6930, and K6935. The third is located at the heterodimer interface. Ligands predicted to have high affinity for the first or second sites are reported. Three sites are also predicted for the nucleocapsid protein. This work uncovers key interactions that contribute to the function of the three viral proteins and also suggests alternative sites for ligand design.more » « less
-
Abstract Eukaryotic genome and methylome encode DNA fragments’ propensity to form nucleosome particles. Although the mechanical properties of DNA possibly orchestrate such encoding, the definite link between ‘omics’ and DNA energetics has remained elusive. Here, we bridge the divide by examining the sequence-dependent energetics of highly bent DNA. Molecular dynamics simulations of 42 intact DNA minicircles reveal that each DNA minicircle undergoes inside-out conformational transitions with the most likely configuration uniquely prescribed by the nucleotide sequence and methylation of DNA. The minicircles’ local geometry consists of straight segments connected by sharp bends compressing the DNA’s inward-facing major groove. Such an uneven distribution of the bending stress favors minimum free energy configurations that avoid stiff base pair sequences at inward-facing major grooves. Analysis of the minicircles’ inside-out free energy landscapes yields a discrete worm-like chain model of bent DNA energetics that accurately account for its nucleotide sequence and methylation. Experimentally measuring the dependence of the DNA looping time on the DNA sequence validates the model. When applied to a nucleosome-like DNA configuration, the model quantitatively reproduces yeast and human genomes’ nucleosome occupancy. Further analyses of the genome-wide chromatin structure data suggest that DNA bending energetics is a fundamental determinant of genome architecture.more » « less
-
Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose mutational effect transfer learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics. We fine-tune METL on experimental sequence–function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.more » « less
-
Essential cellular processes of microtubule disassembly and protein degradation, which span lengths from tens of μm to nm, are mediated by specialized molecular machines with similar hexameric structure and function. Our molecular simulations at atomistic and coarse-grained scales show that both the microtubule-severing protein spastin and the caseinolytic protease ClpY, accomplish spectacular unfolding of their diverse substrates, a microtubule lattice and dihydrofolate reductase (DHFR), by taking advantage of mechanical anisotropy in these proteins. Unfolding of wild-type DHFR requires disruption of mechanically strong β-sheet interfaces near each terminal, which yields branched pathways associated with unzipping along soft directions and shearing along strong directions. By contrast, unfolding of circular permutant DHFR variants involves single pathways due to softer mechanical interfaces near terminals, but translocation hindrance can arise from mechanical resistance of partially unfolded intermediates stabilized by β-sheets. For spastin, optimal severing action initiated by pulling on a tubulin subunit is achieved through specific orientation of the machine versus the substrate (microtubule lattice). Moreover, changes in the strength of the interactions between spastin and a microtubule filament, which can be driven by the tubulin code, lead to drastically different outcomes for the integrity of the hexameric structure of the machine.more » « less
An official website of the United States government
