skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking
Protein complex formation is a central problem in biology, being involved in most of the cell's processes, and essential for applications, e.g. drug design or protein engineering. We tackle rigid body protein-protein docking, i.e., computationally predicting the 3D structure of a protein-protein complex from the individual unbound structures, assuming no conformational change within the proteins happens during binding. We design a novel pairwise-independent SE(3)-equivariant graph matching network to predict the rotation and translation to place one of the proteins at the right docked position relative to the second protein. We mathematically guarantee a basic principle: the predicted complex is always identical regardless of the initial locations and orientations of the two structures. Our model, named EquiDock, approximates the binding pockets and predicts the docking poses using keypoint matching and alignment, achieved through optimal transport and a differentiable Kabsch algorithm. Empirically, we achieve significant running time improvements and often outperform existing docking software despite not relying on heavy candidate sampling, structure refinement, or templates.  more » « less
Award ID(s):
1918839
PAR ID:
10320123
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
International Conference on Learning Representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Comparative docking is based on experimentally determined structures of protein‐protein complexes (templates), following the paradigm that proteins with similar sequences and/or structures form similar complexes. Modeling utilizing structure similarity of target monomers to template complexes significantly expands structural coverage of the interactome. Template‐based docking by structure alignment can be performed for the entire structures or by aligning targets to the bound interfaces of the experimentally determined complexes. Systematic benchmarking of docking protocols based on full and interface structure alignment showed that both protocols perform similarly, with top 1 docking success rate 26%. However, in terms of the models' quality, the interface‐based docking performed marginally better. The interface‐based docking is preferable when one would suspect a significant conformational change in the full protein structure upon binding, for example, a rearrangement of the domains in multidomain proteins. Importantly, if the same structure is selected as the top template by both full and interface alignment, the docking success rate increases 2‐fold for both top 1 and top 10 predictions. Matching structural annotations of the target and template proteins for template detection, as a computationally less expensive alternative to structural alignment, did not improve the docking performance. Sophisticated remote sequence homology detection added templates to the pool of those identified by structure‐based alignment, suggesting that for practical docking, the combination of the structure alignment protocols and the remote sequence homology detection may be useful in order to avoid potential flaws in generation of the structural templates library. 
    more » « less
  2. ABSTRACT Accurate prediction of protein–peptide complex structures plays a critical role in structure‐based drug design, including antibody design. Most peptide‐docking benchmark studies were conducted using crystal structures of protein–peptide complexes; as such, the performance of the current peptide docking tools in the practical setting is unknown. Here, the practical setting implies there are no crystal or other experimental structures for the complex, nor for the receptor and peptide. In this work, we have developed a practical docking protocol that incorporated two famous machine learning models, AlphaFold 2 for structural prediction and ANI‐2x for ab initio potential prediction, to achieve a high success rate in modeling protein–peptide complex structures. The docking protocol consists of three major stages. In the first stage, the 3D structure of the receptor is predicted by AlphaFold 2 using the monomer mode, and that of the peptide is predicted by AlphaFold 2 using the multimer mode. We found that it is essential to include the receptor information to generate a high‐quality 3D structure of the peptide. In the second stage, rigid protein–peptide docking is performed using ZDOCK software. In the last stage, the top 10 docking poses are relaxed and refined by ANI‐2x in conjunction with our in‐house geometry optimization algorithm—conjugate gradient with backtracking line search (CG‐BS). CG‐BS was developed by us to more efficiently perform geometry optimization, which takes the potential and force directly from ANI‐2x machine learning models. The docking protocol achieved a very encouraging performance for a set of 62 very challenging protein–peptide systems which had an overall success rate of 34% if only the top 1 docking poses were considered. This success rate increased to 45% if the top 3 docking poses were considered. It is emphasized that this encouraging protein–peptide docking performance was achieved without using any crystal or experimental structures. 
    more » « less
  3. Abstract Proteins and nucleic acids are key components in many processes in living cells, and interactions between proteins and nucleic acids are often crucial pathway components. In many cases, large flexibility of proteins as they interact with nucleic acids is key to their function. To understand the mechanisms of these processes, it is necessary to consider the 3D atomic structures of such protein–nucleic acid complexes. When such structures are not yet experimentally determined, protein docking can be used to computationally generate useful structure models. However, such docking has long had the limitation that the consideration of flexibility is usually limited to small movements or to small structures. We previously developed a method of flexible protein docking which could model ordered proteins which undergo large‐scale conformational changes, which we also showed was compatible with nucleic acids. Here, we elaborate on the ability of that pipeline, Flex‐LZerD, to model specifically interactions between proteins and nucleic acids, and demonstrate that Flex‐LZerD can model more interactions and types of conformational change than previously shown. 
    more » « less
  4. Abstract Structural information of protein–protein interactions is essential for characterization of life processes at the molecular level. While a small fraction of known protein interactions has experimentally determined structures, computational modeling of protein complexes (protein docking) has to fill the gap. TheDockgroundresource (http://dockground.compbio.ku.edu) provides a collection of datasets for the development and testing of protein docking techniques. Currently,Dockgroundcontains datasets for the bound and the unbound (experimentally determined and simulated) protein structures, model–model complexes, docking decoys of experimentally determined and modeled proteins, and templates for comparative docking. TheDockgroundbound proteins dataset is a core set, from which otherDockgrounddatasets are generated. It is devised as a relational PostgreSQL database containing information on experimentally determined protein–protein complexes. This report on theDockgroundresource describes current status of the datasets, new automated update procedures and further development of the core datasets. We also present a newDockgroundinteractive web interface, which allows search by various parameters, such as release date, multimeric state, complex type, structure resolution, and so on, visualization of the search results with a number of customizable parameters, as well as downloadable datasets with predefined levels of sequence and structure redundancy. 
    more » « less
  5. Skolnick, Jeffrey (Ed.)
    Systematically discovering protein-ligand interactions across the entire human and pathogen genomes is critical in chemical genomics, protein function prediction, drug discovery, and many other areas. However, more than 90% of gene families remain “dark”—i.e., their small-molecule ligands are undiscovered due to experimental limitations or human/historical biases. Existing computational approaches typically fail when the dark protein differs from those with known ligands. To address this challenge, we have developed a deep learning framework, called PortalCG, which consists of four novel components: (i) a 3-dimensional ligand binding site enhanced sequence pre-training strategy to encode the evolutionary links between ligand-binding sites across gene families; (ii) an end-to-end pretraining-fine-tuning strategy to reduce the impact of inaccuracy of predicted structures on function predictions by recognizing the sequence-structure-function paradigm; (iii) a new out-of-cluster meta-learning algorithm that extracts and accumulates information learned from predicting ligands of distinct gene families (meta-data) and applies the meta-data to a dark gene family; and (iv) a stress model selection step, using different gene families in the test data from those in the training and development data sets to facilitate model deployment in a real-world scenario. In extensive and rigorous benchmark experiments, PortalCG considerably outperformed state-of-the-art techniques of machine learning and protein-ligand docking when applied to dark gene families, and demonstrated its generalization power for target identifications and compound screenings under out-of-distribution (OOD) scenarios. Furthermore, in an external validation for the multi-target compound screening, the performance of PortalCG surpassed the rational design from medicinal chemists. Our results also suggest that a differentiable sequence-structure-function deep learning framework, where protein structural information serves as an intermediate layer, could be superior to conventional methodology where predicted protein structures were used for the compound screening. We applied PortalCG to two case studies to exemplify its potential in drug discovery: designing selective dual-antagonists of dopamine receptors for the treatment of opioid use disorder (OUD), and illuminating the understudied human genome for target diseases that do not yet have effective and safe therapeutics. Our results suggested that PortalCG is a viable solution to the OOD problem in exploring understudied regions of protein functional space. 
    more » « less