skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Thursday, March 12 until 2:00 AM ET on Friday, March 13 due to maintenance. We apologize for the inconvenience.


Title: Adaptive Stochastic Optimization to Improve Protein Conformation Sampling
We have long known that characterizing protein structures structure is key to understanding protein function. Computational approaches have largely addressed a narrow formulation of the problem, seeking to compute one native structure from an amino-acid sequence. Now AlphaFold2 promises to reveal a high-quality native structure for possibly many proteins. However, researchers over the years have argued for broadening our view to account for the multiplicity of native structures. We now know that many protein molecules switch between different structures to regulate interactions with molecular partners in the cell. Elucidating such structures de novo is exceptionally difficult, as it requires exploration of possibly a very large structure space in search of competing, near-optimal structures. Here we report on a novel stochastic optimization method capable of revealing very different structures for a given protein from knowledge of its amino-acid sequence. The method leverages evolutionary search techniques and adapts its exploration of the search space to balance between exploration and exploitation in the presence of a computational budget. In addition to demonstrating the utility of this method for identifying multiple native structures, we additionally provide a benchmark dataset for researchers to continue work on this problem.  more » « less
Award ID(s):
1900061
PAR ID:
10343790
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE/ACM Transactions on Computational Biology and Bioinformatics
ISSN:
1545-5963
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The mutation space of spatially conserved (MSSC) amino acid residues is a protein structural quantity developed and described in this work. The MSSC quantifies how many mutations and which different mutations, i.e., the mutation space, occur in each amino acid site in a protein. The MSSC calculates the mutation space of amino acids in a target protein from the spatially conserved residues in a group of multiple protein structures. Spatially conserved amino acid residues are identified based on their relative positions in the protein structure. The MSSC examines each residue in a target protein, compares it to the residues present in the same relative position in other protein structures, and uses physicochemical criteria of mutations found in each conserved spatial site to quantify the mutation space of each amino acid in the target protein. The MSSC is analogous to scoring each site in a multiple sequence alignment but in three-dimensional space considering the spatial location of residues instead of solely the order in which they appear in a protein sequence. MSSC analysis was performed on example cases, and it reproduces the well-known observation that, regardless of secondary structure, solvent-exposed residues are more likely to be mutated than internal ones. The MSSC code is available on GitHub: “https://github.com/Cantu-Research-Group/Mutation_Space”. 
    more » « less
  2. Recent advances in Artificial Intelligence and Machine Learning (e.g., AlphaFold, RosettaFold, and ESMFold) enable prediction of three-dimensional (3D) protein structures from amino acid sequences alone at accuracies comparable to lower-resolution experimental methods. These tools have been employed to predict structures across entire proteomes and the results of large-scale metagenomic sequence studies, yielding an exponential increase in available biomolecular 3D structural information. Given the enormous volume of this newly computed biostructure data, there is an urgent need for robust tools to manage, search, cluster, and visualize large collections of structures. Equally important is the capability to efficiently summarize and visualize metadata, biological/biochemical annotations, and structural features, particularly when working with vast numbers of protein structures of both experimental origin from the Protein Data Bank (PDB) and computationally-predicted models. Moreover, researchers require advanced visualization techniques that support interactive exploration of multiple sequences and structural alignments. This paper introduces a suite of tools provided on the RCSB PDB research-focused web portal RCSB. org, tailor-made for efficient management, search, organization, and visualization of this burgeoning corpus of 3D macromolecular structure data. 
    more » « less
  3. ResNet and, more recently, AlphaFold2 have demonstrated that deep neural networks can now predict a tertiary structure of a given protein amino-acid sequence with high accuracy. This seminal development will allow molecular biology researchers to advance various studies linking sequence, structure, and function. Many studies will undoubtedly focus on the impact of sequence mutations on stability, fold, and function. In this paper, we evaluate the ability of AlphaFold2 to predict accurate tertiary structures of wildtype and mutated sequences of protein molecules. We do so on a benchmark dataset in mutation modeling studies. Our empirical evaluation utilizes global and local structure analyses and yields several interesting observations. It shows, for instance, that AlphaFold2 performs similarly on wildtype and variant sequences. The placement of the main chain of a protein molecule is highly accurate. However, while AlphaFold2 reports similar confidence in its predictions over wildtype and variant sequences, its performance on placements of the side chains suffers in comparison to main-chain predictions. The analysis overall supports the premise that AlphaFold2-predicted structures can be utilized in further downstream tasks, but that further refinement of these structures may be necessary. 
    more » « less
  4. Deep learning research, from ResNet to AlphaFold2, convincingly shows that deep learning can predict the native conformation of a given protein sequence with high accu- racy. Accounting for the plasticity of protein molecules remains challenging, and powerful algorithms are needed to sample the conformation space of a given amino-acid sequence. In the complex and high-dimensional energy surface that accompanies this space, it is critical to explore a broad range of areas. In this paper, we present a novel evolutionary algorithm that guides its optimization process with a memory of the explored conformation space, so that it can avoid searching already explored regions and search in the unexplored regions. The algorithm periodically consults an evolving map that stores already sampled non- redundant conformations to enhance exploration during selection. Evaluation on diverse datasets shows superior performance of the algorithm over the state-of-the-art algorithms. 
    more » « less
  5. null (Ed.)
    The relation between amino acid (AA) sequence and biologically active conformation controls the process of polypeptide chains folding into three-dimensional (3d) protein structures. The recent achievements in the resolution achieved in cryo-electron microscopy coupled with improvements in computational methodologies have accelerated the analysis of structures and properties of proteins. However, the detailed interaction between AAs has not been fully elucidated. Herein, we present a de novo method to evaluate inter-amino acid interactions based on the concept of accurately evaluating the amino acid bond pairs (AABP). The results obtained enabled the identification of complex 3d long-range interconnected AA interacting network in proteins. The method is applied to the receptor binding domain (RBD) of the SARS-CoV-2 spike protein. We show that although nearest-neighbor AAs in the primary sequence have large AABP, other nonlocal AAs make substantial contribution to AABP with significant participation of both covalent and hydrogen bonding. Detailed analysis of AABP in RBD reveals the pivotal role they play in sequence conservation with profound implications on residue mutations and for therapeutic drug design. This approach could be easily applied to many other proteins of biomedical interest in life sciences. 
    more » « less