skip to main content


Title: Analysis of AlphaFold2 for Modeling Structures of Wildtype and Variant Protein Sequences

ResNet and, more recently, AlphaFold2 have demonstrated that deep neural networks can now predict a tertiary structure of a given protein amino-acid sequence with high accuracy. This seminal development will allow molecular biology researchers to advance various studies linking sequence, structure, and function. Many studies will undoubtedly focus on the impact of sequence mutations on stability, fold, and function. In this paper, we evaluate the ability of AlphaFold2 to predict accurate tertiary structures of wildtype and mutated sequences of protein molecules. We do so on a benchmark dataset in mutation modeling studies. Our empirical evaluation utilizes global and local structure analyses and yields several interesting observations. It shows, for instance, that AlphaFold2 performs similarly on wildtype and variant sequences. The placement of the main chain of a protein molecule is highly accurate. However, while AlphaFold2 reports similar confidence in its predictions over wildtype and variant sequences, its performance on placements of the side chains suffers in comparison to main-chain predictions. The analysis overall supports the premise that AlphaFold2-predicted structures can be utilized in further downstream tasks, but that further refinement of these structures may be necessary.

 
more » « less
Award ID(s):
1763233
PAR ID:
10342795
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
EPiC Series in Computing
Volume:
83
ISSN:
2398-7340
Page Range / eLocation ID:
53 to 39
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.

    Results

    We developed TransFun—a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.

    Availability and implementation

    The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.

     
    more » « less
  2. Significant research on deep neural networks, culminating in AlphaFold2, convincingly shows that deep learning can predict the na- tive structure of a given protein sequence with high accuracy. In contrast, work on deep learning frameworks that can account for the structural plasticity of protein molecules remains in its infancy. Many researchers are now investigating deep generative models to explore the structure space of a protein. Current models largely use 2D convolution, leveraging representations of protein structures as contact maps or distance matri- ces. The goal is exclusively to generate protein-like, sequence-agnostic tertiary structures, but no rigorous metrics are utilized to convincingly make this case. This paper makes several contributions. It builds on momentum in graph representation learning and formalizes a protein tertiary structure as a contact graph. It demonstrates that graph repre- sentation learning outperforms models based on image convolution. This work also equips graph-based deep latent variable models with the abil- ity to learn from experimentally-available tertiary structures of proteins of varying lengths. The resulting models are shown to outperform state- of-the-art ones on rigorous metrics that quantify both local and distal patterns in physically-realistic protein structures. We hope this work will spur further research in deep generative models for obtaining a broader view of the structure space of a protein molecule. 
    more » « less
  3. Predicting protein side-chains is important for both protein structure prediction and protein design. Modeling approaches to predict side-chains such as SCWRL4 have become one of the most widely used tools of its type due to fast and highly accurate predictions. Motivated by the recent success of AlphaFold2 in CASP14, our group adapted a 3D equivariant neural network architecture to predict protein side-chain conformations, specifically within a protein-protein interface, a problem that has not been fully addressed by AlphaFold2. 
    more » « less
  4. Frappier, Lori (Ed.)
    ABSTRACT Two new structures of the N-terminal domain of the main replication protein, NS1, of human parvovirus B19 (B19V) are presented here. This domain (NS1-nuc) plays an important role in the “rolling hairpin” replication of the single-stranded B19V DNA genome, recognizing origin of replication sequences in double-stranded DNA, and cleaving (i.e., nicking) single-stranded DNA at a nearby site known as the terminal resolution site (trs). The three-dimensional structure of NS1-nuc is well conserved between the two forms, as well as with a previously solved structure of a sequence variant of the same domain; however, it is shown here at a significantly higher resolution (2.4 Å). Using structures of NS1-nuc homologues bound to single- and double-stranded DNA, models for DNA recognition and nicking by B19V NS1-nuc are presented that predict residues important for DNA cleavage and for sequence-specific recognition at the viral origin of replication. IMPORTANCE The high-resolution structure of the DNA binding and cleavage domain of the main replicative protein, NS1, from the human-pathogenic virus human parvovirus B19 is presented here. Included also are predictions of how the protein recognizes important sequences in the viral DNA which are required for viral replication. These predictions can be used to further investigate the function of this protein, as well as to predict the effects on viral viability due to mutations in the viral protein and viral DNA sequences. Finally, the high-resolution structure facilitates structure-guided drug design efforts to develop antiviral compounds against this important human pathogen. 
    more » « less
  5. Since the 14th Critical Assessment of Techniques for Protein Structure Prediction (CASP14), AlphaFold2 has become the standard method for protein tertiary structure prediction. One remaining challenge is to further improve its prediction. We developed a new version of the MULTICOM system to sample diverse multiple sequence alignments (MSAs) and structural templates to improve the input for AlphaFold2 to generate structural models. The models are then ranked by both the pairwise model similarity and AlphaFold2 self-reported model quality score. The top ranked models are refined by a novel structure alignment-based refinement method powered by Foldseek. Moreover, for a monomer target that is a subunit of a protein assembly (complex), MULTICOM integrates tertiary and quaternary structure predictions to account for tertiary structural changes induced by protein-protein interaction. The system participated in the tertiary structure prediction in 2022 CASP15 experiment. Our server predictor MULTICOM_refine ranked 3rd among 47 CASP15 server predictors and our human predictor MULTICOM ranked 7th among all 132 human and server predictors. The average GDT-TS score and TM-score of the first structural models that MULTICOM_refine predicted for 94 CASP15 domains are ~0.80 and ~0.92, 9.6% and 8.2% higher than ~0.73 and 0.85 of the standard AlphaFold2 predictor respectively.

     
    more » « less