skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, February 13 until 2:00 AM ET on Friday, February 14 due to maintenance. We apologize for the inconvenience.


Title: Data-driven supervised learning of a viral protease specificity landscape from deep sequencing and molecular simulations

Biophysical interactions between proteins and peptides are key determinants of molecular recognition specificity landscapes. However, an understanding of how molecular structure and residue-level energetics at protein−peptide interfaces shape these landscapes remains elusive. We combine information from yeast-based library screening, next-generation sequencing, and structure-based modeling in a supervised machine learning approach to report the comprehensive sequence−energetics−function mapping of the specificity landscape of the hepatitis C virus (HCV) NS3/4A protease, whose function—site-specific cleavages of the viral polyprotein—is a key determinant of viral fitness. We screened a library of substrates in which five residue positions were randomized and measured cleavability of ∼30,000 substrates (∼1% of the library) using yeast display and fluorescence-activated cell sorting followed by deep sequencing. Structure-based models of a subset of experimentally derived sequences were used in a supervised learning procedure to train a support vector machine to predict the cleavability of 3.2 million substrate variants by the HCV protease. The resulting landscape allows identification of previously unidentified HCV protease substrates, and graph-theoretic analyses reveal extensive clustering of cleavable and uncleavable motifs in sequence space. Specificity landscapes of known drug-resistant variants are similarly clustered. The described approach should enable the elucidation and redesign of specificity landscapes of a wide variety of proteases, including human-origin enzymes. Our results also suggest a possible role for residue-level energetics in shaping plateau-like functional landscapes predicted from viral quasispecies theory.

 
more » « less
Award ID(s):
1716623
PAR ID:
10082246
Author(s) / Creator(s):
; ;
Publisher / Repository:
Proceedings of the National Academy of Sciences
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
116
Issue:
1
ISSN:
0027-8424
Page Range / eLocation ID:
p. 168-176
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Site-specific proteolysis by the enzymatic cleavage of small linear sequence motifs is a key posttranslational modification involved in physiology and disease. The ability to robustly and rapidly predict protease–substrate specificity would also enable targeted proteolytic cleavage by designed proteases. Current methods for predicting protease specificity are limited to sequence pattern recognition in experimentally derived cleavage data obtained for libraries of potential substrates and generated separately for each protease variant. We reasoned that a more semantically rich and robust model of protease specificity could be developed by incorporating the energetics of molecular interactions between protease and substrates into machine learning workflows. We present Protein Graph Convolutional Network (PGCN), which develops a physically grounded, structure-based molecular interaction graph representation that describes molecular topology and interaction energetics to predict enzyme specificity. We show that PGCN accurately predicts the specificity landscapes of several variants of two model proteases. Node and edge ablation tests identified key graph elements for specificity prediction, some of which are consistent with known biochemical constraints for protease:substrate recognition. We used a pretrained PGCN model to guide the design of protease libraries for cleaving two noncanonical substrates, and found good agreement with experimental cleavage results. Importantly, the model can accurately assess designs featuring diversity at positions not present in the training data. The described methodology should enable the structure-based prediction of specificity landscapes of a wide variety of proteases and the construction of tailor-made protease editors for site-selectively and irreversibly modifying chosen target proteins.

     
    more » « less
  2. Abstract

    Eukaryotic genome and methylome encode DNA fragments’ propensity to form nucleosome particles. Although the mechanical properties of DNA possibly orchestrate such encoding, the definite link between ‘omics’ and DNA energetics has remained elusive. Here, we bridge the divide by examining the sequence-dependent energetics of highly bent DNA. Molecular dynamics simulations of 42 intact DNA minicircles reveal that each DNA minicircle undergoes inside-out conformational transitions with the most likely configuration uniquely prescribed by the nucleotide sequence and methylation of DNA. The minicircles’ local geometry consists of straight segments connected by sharp bends compressing the DNA’s inward-facing major groove. Such an uneven distribution of the bending stress favors minimum free energy configurations that avoid stiff base pair sequences at inward-facing major grooves. Analysis of the minicircles’ inside-out free energy landscapes yields a discrete worm-like chain model of bent DNA energetics that accurately account for its nucleotide sequence and methylation. Experimentally measuring the dependence of the DNA looping time on the DNA sequence validates the model. When applied to a nucleosome-like DNA configuration, the model quantitatively reproduces yeast and human genomes’ nucleosome occupancy. Further analyses of the genome-wide chromatin structure data suggest that DNA bending energetics is a fundamental determinant of genome architecture.

     
    more » « less
  3. Essential cellular processes of microtubule disassembly and protein degradation, which span lengths from tens of μm to nm, are mediated by specialized molecular machines with similar hexameric structure and function. Our molecular simulations at atomistic and coarse-grained scales show that both the microtubule-severing protein spastin and the caseinolytic protease ClpY, accomplish spectacular unfolding of their diverse substrates, a microtubule lattice and dihydrofolate reductase (DHFR), by taking advantage of mechanical anisotropy in these proteins. Unfolding of wild-type DHFR requires disruption of mechanically strong β-sheet interfaces near each terminal, which yields branched pathways associated with unzipping along soft directions and shearing along strong directions. By contrast, unfolding of circular permutant DHFR variants involves single pathways due to softer mechanical interfaces near terminals, but translocation hindrance can arise from mechanical resistance of partially unfolded intermediates stabilized by β-sheets. For spastin, optimal severing action initiated by pulling on a tubulin subunit is achieved through specific orientation of the machine versus the substrate (microtubule lattice). Moreover, changes in the strength of the interactions between spastin and a microtubule filament, which can be driven by the tubulin code, lead to drastically different outcomes for the integrity of the hexameric structure of the machine. 
    more » « less
  4. Computational function prediction is one of the most important problems in bioinformatics as elucidating the function of genes is a central task in molecular biology and genomics. Most of the existing function prediction methods use protein sequences as the primary source of input information because the sequence is the most available information for query proteins. There are attempts to consider other attributes of query proteins. Among these attributes, the three-dimensional (3D) structure of proteins is known to be very useful in identifying the evolutionary relationship of proteins, from which functional similarity can be inferred. Here, we report a novel protein function prediction method, ContactPFP, which uses predicted residue-residue contact maps as input structural features of query proteins. Although 3D structure information is known to be useful, it has not been routinely used in function prediction because the 3D structure is not experimentally determined for many proteins. In ContactPFP, we overcome this limitation by using residue-residue contact prediction, which has become increasingly accurate due to rapid development in the protein structure prediction field. ContactPFP takes a query protein sequence as input and uses predicted residue-residue contact as a proxy for the 3D protein structure. To characterize how predicted contacts contribute to function prediction accuracy, we compared the performance of ContactPFP with several well-established sequence-based function prediction methods. The comparative study revealed the advantages and weaknesses of ContactPFP compared to contemporary sequence-based methods. There were many cases where it showed higher prediction accuracy. We examined factors that affected the accuracy of ContactPFP using several illustrative cases that highlight the strength of our method. 
    more » « less
  5. The relationship between genotype and phenotype, or the fitness landscape, is the foundation of genetic engineering and evolution. However, mapping fitness landscapes poses a major technical challenge due to the amount of quantifiable data that is required. Catalytic RNA is a special topic in the study of fitness landscapes due to its relatively small sequence space combined with its importance in synthetic biology. The combination of in vitro selection and high-throughput sequencing has recently provided empirical maps of both complete and local RNA fitness landscapes, but the astronomical size of sequence space limits purely experimental investigations. Next steps are likely to involve data-driven interpolation and extrapolation over sequence space using various machine learning techniques. We discuss recent progress in understanding RNA fitness landscapes, particularly with respect to protocells and machine representations of RNA. The confluence of technical advances may significantly impact synthetic biology in the near future.

     
    more » « less