skip to main content


Title: Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features
Abstract X-ray crystallography is the major approach for determining atomic-level protein structures. Because not all proteins can be easily crystallized, accurate prediction of protein crystallization propensity provides critical help in guiding experimental design and improving the success rate of X-ray crystallography experiments. This study has developed a new machine-learning-based pipeline that uses a newly developed deep-cascade forest (DCF) model with multiple types of sequence-based features to predict protein crystallization propensity. Based on the developed pipeline, two new protein crystallization propensity predictors, denoted as DCFCrystal and MDCFCrystal, have been implemented. DCFCrystal is a multistage predictor that can estimate the success propensities of the three individual steps (production of protein material, purification and production of crystals) in the protein crystallization process. MDCFCrystal is a single-stage predictor that aims to estimate the probability that a protein will pass through the entire crystallization process. Moreover, DCFCrystal is designed for general proteins, whereas MDCFCrystal is specially designed for membrane proteins, which are notoriously difficult to crystalize. DCFCrystal and MDCFCrystal were separately tested on two benchmark datasets consisting of 12 289 and 950 proteins, respectively, with known crystallization results from various experimental records. The experimental results demonstrated that DCFCrystal and MDCFCrystal increased the value of Matthew’s correlation coefficient by 199.7% and 77.8%, respectively, compared to the best of other state-of-the-art protein crystallization propensity predictors. Detailed analyses show that the major advantages of DCFCrystal and MDCFCrystal lie in the efficiency of the DCF model and the sensitivity of the sequence-based features used, especially the newly designed pseudo-predicted hybrid solvent accessibility (PsePHSA) feature, which improves crystallization recognition by incorporating sequence-order information with solvent accessibility of residues. Meanwhile, the new crystal-dataset constructions help to train the models with more comprehensive crystallization knowledge.  more » « less
Award ID(s):
1901191
NSF-PAR ID:
10167323
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Briefings in Bioinformatics
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Intrinsically disordered regions lack stable structure in their native conformation but are nevertheless functional and highly abundant, particularly in Eukaryotes. Disordered moonlighting regions (DMRs) are intrinsically disordered regions that carry out multiple functions. DMRs are different from moonlighting proteins that could be structured and that are annotated at the whole‐protein level. DMRs cannot be identified by current predictors of functions of disorder that focus on specific functions rather than multifunctional regions. We conceptualized, designed and empirically assessed first‐of‐its‐kind sequence‐based predictor of DMRs, DMRpred. This computational tool outputs propensity for being in a DMR for each residue in an input protein sequence. We developed novel amino acid indices that quantify propensities for functions relevant to DMRs and used evolutionary conservation, putative solvent accessibility and intrinsic disorder derived from the input sequence to build a rich profile that is suitable to accurately predict DMRs. We processed this profile to derive innovative features that we input into a Random Forest model to generate the predictions. Empirical assessment shows that DMRpred generates accurate predictions with area under receiver operating characteristic curve = 0.86 and accuracy = 82%. These results are significantly better than the closest alternative approaches that rely on sequence alignment, evolutionary conservation and putative disorder and disorder functions. Analysis of abundance of putative DMRs in the human proteome reveals that as many as 25% of proteins may have long >30 residues) DMRs. A webserver implementation of DMRpred is available athttp://biomine.cs.vcu.edu/servers/DMRpred/

     
    more » « less
  2. Abstract

    In this article, we report 3D structure prediction results by two of our best server groups (“Zhang‐Server” and “QUARK”) in CASP14. These two servers were built based on the D‐I‐TASSER and D‐QUARK algorithms, which integrated four newly developed components into the classical protein folding pipelines, I‐TASSER and QUARK, respectively. The new components include: (a) a new multiple sequence alignment (MSA) collection tool, DeepMSA2, which is extended from the DeepMSA program; (b) a contact‐based domain boundary prediction algorithm, FUpred, to detect protein domain boundaries; (c) a residual convolutional neural network‐based method, DeepPotential, to predict multiple spatial restraints by co‐evolutionary features derived from the MSA; and (d) optimized spatial restraint energy potentials to guide the structure assembly simulations. For 37 FM targets, the average TM‐scores of the first models produced by D‐I‐TASSER and D‐QUARK were 96% and 112% higher than those constructed by I‐TASSER and QUARK, respectively. The data analysis indicates noticeable improvements produced by each of the four new components, especially for the newly added spatial restraints from DeepPotential and the well‐tuned force field that combines spatial restraints, threading templates, and generic knowledge‐based potentials. However, challenges still exist in the current pipelines. These include difficulties in modeling multi‐domain proteins due to low accuracy in inter‐domain distance prediction and modeling protein domains from oligomer complexes, as the co‐evolutionary analysis cannot distinguish inter‐chain and intra‐chain distances. Specifically tuning the deep learning‐based predictors for multi‐domain targets and protein complexes may be helpful to address these issues.

     
    more » « less
  3. Cryo-electron microscopy (cryo-EM) is a structural technique that has played a significant role in protein structure determination in recent years. Compared to the traditional methods of X-ray crystallography and NMR spectroscopy, cryo-EM is capable of producing images of much larger protein complexes. However, cryo-EM reconstructions are limited to medium-resolution (~4–10 Å) for some cases. At this resolution range, a cryo-EM density map can hardly be used to directly determine the structure of proteins at atomic level resolutions, or even at their amino acid residue backbones. At such a resolution, only the position and orientation of secondary structure elements (SSEs) such as α-helices and β-sheets are observable. Consequently, finding the mapping of the secondary structures of the modeled structure (SSEs-A) to the cryo-EM map (SSEs-C) is one of the primary concerns in cryo-EM modeling. To address this issue, this study proposes a novel automatic computational method to identify SSEs correspondence in three-dimensional (3D) space. Initially, through a modeling of the target sequence with the aid of extracting highly reliable features from a generated 3D model and map, the SSEs matching problem is formulated as a 3D vector matching problem. Afterward, the 3D vector matching problem is transformed into a 3D graph matching problem. Finally, a similarity-based voting algorithm combined with the principle of least conflict (PLC) concept is developed to obtain the SSEs correspondence. To evaluate the accuracy of the method, a testing set of 25 experimental and simulated maps with a maximum of 65 SSEs is selected. Comparative studies are also conducted to demonstrate the superiority of the proposed method over some state-of-the-art techniques. The results demonstrate that the method is efficient, robust, and works well in the presence of errors in the predicted secondary structures of the cryo-EM images. 
    more » « less
  4. Abstract

    In recent years, new protein engineering methods have produced more than a dozen symmetric, self‐assembling protein cages whose structures have been validated to match their design models with near‐atomic accuracy. However, many protein cage designs that are tested in the lab do not form the desired assembly, and improving the success rate of design has been a point of recent emphasis. Here we present two protein structures solved by X‐ray crystallography of designed protein oligomers that form two‐component cages with tetrahedral symmetry. To improve on the past tendency toward poorly soluble protein, we used a computational protocol that favors the formation of hydrogen‐bonding networks over exclusively hydrophobic interactions to stabilize the designed protein–protein interfaces. Preliminary characterization showed highly soluble expression, and solution studies indicated successful cage formation by both designed proteins. For one of the designs, a crystal structure confirmed at high resolution that the intended tetrahedral cage was formed, though several flipped amino acid side chain rotamers resulted in an interface that deviates from the precise hydrogen‐bonding pattern that was intended. A structure of the other designed cage showed that, under the conditions where crystals were obtained, a noncage structure was formed wherein a porous 3D protein network in space group I213 is generated by an off‐target twofold homomeric interface. These results illustrate some of the ongoing challenges of developing computational methods for polar interface design, and add two potentially valuable new entries to the growing list of engineered protein materials for downstream applications.

     
    more » « less
  5. Abstract Over the past two decades, mass spectrometric (MS)-based proteomics technologies have facilitated the study of signaling pathways throughout biology. Nowhere is this needed more than in plants, where an evolutionary history of genome duplications has resulted in large gene families involved in posttranslational modifications and regulatory pathways. For example, at least 5% of the Arabidopsis thaliana genome (ca. 1,200 genes) encodes protein kinases and protein phosphatases that regulate nearly all aspects of plant growth and development. MS-based technologies that quantify covalent changes in the side-chain of amino acids are critically important, but they only address one piece of the puzzle. A more crucially important mechanistic question is how noncovalent interactions—which are more difficult to study—dynamically regulate the proteome’s 3D structure. The advent of improvements in protein 3D technologies such as cryo-electron microscopy, nuclear magnetic resonance, and X-ray crystallography has allowed considerable progress to be made at this level, but these methods are typically limited to analyzing proteins, which can be expressed and purified in milligram quantities. Newly emerging MS-based technologies have recently been developed for studying the 3D structure of proteins. Importantly, these methods do not require protein samples to be purified and require smaller amounts of sample, opening the wider proteome for structural analysis in complex mixtures, crude lysates, and even in intact cells. These MS-based methods include covalent labeling, crosslinking, thermal proteome profiling, and limited proteolysis, all of which can be leveraged by established MS workflows, as well as newly emerging methods capable of analyzing intact macromolecules and the complexes they form. In this review, we discuss these recent innovations in MS-based “structural” proteomics to provide readers with an understanding of the opportunities they offer and the remaining challenges for understanding the molecular underpinnings of plant structure and function. 
    more » « less