skip to main content

Title: Effective prediction of short hydrogen bonds in proteins via machine learning method

Short hydrogen bonds (SHBs), whose donor and acceptor heteroatoms lie within 2.7 Å, exhibit prominent quantum mechanical characters and are connected to a wide range of essential biomolecular processes. However, exact determination of the geometry and functional roles of SHBs requires a protein to be at atomic resolution. In this work, we analyze 1260 high-resolution peptide and protein structures from the Protein Data Bank and develop a boosting based machine learning model to predict the formation of SHBs between amino acids. This model, which we name as machine learning assisted prediction of short hydrogen bonds (MAPSHB), takes into account 21 structural, chemical and sequence features and their interaction effects and effectively categorizes each hydrogen bond in a protein to a short or normal hydrogen bond. The MAPSHB model reveals that the type of the donor amino acid plays a major role in determining the class of a hydrogen bond and that the side chain Tyr-Asp pair demonstrates a significant probability of forming a SHB. Combining electronic structure calculations and energy decomposition analysis, we elucidate how the interplay of competing intermolecular interactions stabilizes the Tyr-Asp SHBs more than other commonly observed combinations of amino acid side chains. The MAPSHB model, more » which is freely available on our web server, allows one to accurately and efficiently predict the presence of SHBs given a protein structure with moderate or low resolution and will facilitate the experimental and computational refinement of protein structures.

« less
; ; ;
Award ID(s):
Publication Date:
Journal Name:
Scientific Reports
Nature Publishing Group
Sponsoring Org:
National Science Foundation
More Like this
  1. Hydrogen bonds (HBs) play an essential role in the structure and catalytic action of enzymes, but a complete understanding of HBs in proteins challenges the resolution of modern structural ( i.e. , X-ray diffraction) techniques and mandates computationally demanding electronic structure methods from correlated wavefunction theory for predictive accuracy. Numerous amino acid sidechains contain functional groups ( e.g. , hydroxyls in Ser/Thr or Tyr and amides in Asn/Gln) that can act as either HB acceptors or donors (HBA/HBD) and even form simultaneous, ambifunctional HB interactions. To understand the relative energetic benefit of each interaction, we characterize the potential energy surfaces of representative model systems with accurate coupled cluster theory calculations. To reveal the relationship of these energetics to the balance of these interactions in proteins, we curate a set of 4000 HBs, of which >500 are ambifunctional HBs, in high-resolution protein structures. We show that our model systems accurately predict the favored HB structural properties. Differences are apparent in HBA/HBD preference for aromatic Tyr versus aliphatic Ser/Thr hydroxyls because Tyr forms significantly stronger O–H⋯O HBs than N–H⋯O HBs in contrast to comparable strengths of the two for Ser/Thr. Despite this residue-specific distinction, all models of residue pairs indicate an energeticmore »benefit for simultaneous HBA and HBD interactions in an ambifunctional HB. Although the stabilization is less than the additive maximum due both to geometric constraints and many-body electronic effects, a wide range of ambifunctional HB geometries are more favorable than any single HB interaction.« less
  2. The three-dimensional architecture of biomolecules often creates specialized structural elements, notably short hydrogen bonds that have donor–acceptor separations below 2.7 Å. In this work, we statistically analyze 1663 high-resolution biomolecular structures from the Protein Data Bank and demonstrate that short hydrogen bonds are prevalent in proteins, protein–ligand complexes and nucleic acids. From these biological macromolecules, we characterize the preferred location, connectivity and amino acid composition in short hydrogen bonds and hydrogen bond networks, and assess their possible functional importance. Using electronic structure calculations, we further uncover how the interplay of the structural and chemical features determines the proton potential energy surfaces and proton sharing conditions in biological short hydrogen bonds.
  3. Disordered proline-rich motifs are common across the proteomes of many species and are often involved in protein-protein interactions. Proline is a unique amino acid due to the covalent bond between the backbone nitrogen and the proline side chain. The resulting five-membered ring allows proline to sample the cis state about its peptide bond, which other residues cannot do as readily. Because proline-rich disordered sequences exist as ensembles that likely include structures with the proline peptide bond in cis , a robust methodology to accurately account for these conformations in the overall ensemble is crucial. Observing the cis conformations of proline in a disordered sequence is challenging both experimentally and computationally. Nitrogen-hydrogen NMR spectroscopy cannot directly observe proline residues, which lack an amide bond, and computational methods struggle to overcome the large kinetic barrier between the cis and trans states, since isomerization usually occurs on the order of seconds. In the current work, Gaussian accelerated molecular dynamics was used to overcome this free energy barrier and simulate proline isomerization in a tetrapeptide (KPTP) and in the 12-residue proline-rich SH3 binding peptide, ArkA. We found that Gaussian accelerated molecular dynamics, when combined with a lowered peptide bond dihedral angle potential energy barriermore »(15 kcal/mol), allowed sufficient sampling of the proline cis and trans states on a microsecond timescale. All ArkA prolines spend a significant fraction of time in cis , leading to a more compact ensemble with less polyproline II helix structure than an ArkA ensemble with all peptide bonds in trans . The ensemble containing cis prolines also matches more closely to in vitro circular dichroism data than the all- trans ensemble. The ability of the ArkA prolines to isomerize likely affects the peptide’s ability to bind its partner SH3 domain, and should be studied further. This is the first molecular dynamics simulation study of proline isomerization in a biologically relevant proline-rich sequence that we know of, and a similar protocol could be applied to study multi-proline isomerization in other proline-containing proteins to improve conformational diversity and agreement with in vitro data.« less
  4. NMR-assisted crystallography—the integrated application of solid-state NMR, X-ray crystallography, and first-principles computational chemistry—holds significant promise for mechanistic enzymology: by providing atomic-resolution characterization of stable intermediates in enzyme active sites, including hydrogen atom locations and tautomeric equilibria, NMR crystallography offers insight into both structure and chemical dynamics. Here, this integrated approach is used to characterize the tryptophan synthase α-aminoacrylate intermediate, a defining species for pyridoxal-5′-phosphate–dependent enzymes that catalyze β-elimination and replacement reactions. For this intermediate, NMR-assisted crystallography is able to identify the protonation states of the ionizable sites on the cofactor, substrate, and catalytic side chains as well as the location and orientation of crystallographic waters within the active site. Most notable is the water molecule immediately adjacent to the substrate β-carbon, which serves as a hydrogen bond donor to the ε-amino group of the acid–base catalytic residue βLys87. From this analysis, a detailed three-dimensional picture of structure and reactivity emerges, highlighting the fate of the L-serine hydroxyl leaving group and the reaction pathway back to the preceding transition state. Reaction of the α-aminoacrylate intermediate with benzimidazole, an isostere of the natural substrate indole, shows benzimidazole bound in the active site and poised for, but unable to initiate, the subsequent bondmore »formation step. When modeled into the benzimidazole position, indole is positioned with C3 in contact with the α-aminoacrylate C β and aligned for nucleophilic attack. Here, the chemically detailed, three-dimensional structure from NMR-assisted crystallography is key to understanding why benzimidazole does not react, while indole does.« less
  5. EDITOR-IN-CHIEF Dokholyan, Nikolay V. ; ASSOCIATE EDITORS: Bahar, Ivet ; Feig, Michael ; Varadarajan, Raghavan ; Wodak, Shoshana ; Moult, John Center (Ed.)
    Prediction of side chain conformations of amino acids in proteins (also termed “packing”) is an important and challenging part of protein structure prediction with many interesting applications in protein design. A variety of methods for packing have been developed but more accurate ones are still needed. Machine learning (ML) methods have recently become a powerful tool for solving various problems in diverse areas of science, including structural biology. In this study, we evaluate the potential of deep neural networks (DNNs) for prediction of amino acid side chain conformations. We formulate the problem as image-to-image transformation and train a U-net style DNN to solve the problem. We show that our method outperforms other physics-based methods by a significant margin: reconstruction RMSDs for most amino acids are about 20% smaller compared to SCWRL4 and Rosetta Packer with RMSDs for bulky hydrophobic amino acids Phe, Tyr, and Trp being up to 50% smaller.