Transformer neural networks have revolutionized structural biology with the ability to predict protein structures at unprecedented high accuracy. Here, we report the predictive modeling performance of the state-of-the-art protein structure prediction methods built on transformers for 69 protein targets from the recently concluded 15th Critical Assessment of Structure Prediction (CASP15) challenge. Our study shows the power of transformers in protein structure modeling and highlights future areas of improvement.
Transformers are a powerful subclass of neural networks catalyzing the development of a growing number of computational methods for RNA structure modeling. Here, we conduct an objective and empirical study of the predictive modeling accuracy of the emerging transformer-based methods for RNA structure prediction. Our study reveals multi-faceted complementarity between the methods and underscores some key aspects that affect the prediction accuracy.
more » « less- NSF-PAR ID:
- 10523065
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Biology Methods and Protocols
- Volume:
- 9
- Issue:
- 1
- ISSN:
- 2396-8923
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract For many RNA molecules, the secondary structure is essential for the correct function of the RNA. Predicting RNA secondary structure from nucleotide sequences is a long-standing problem in genomics, but the prediction performance has reached a plateau over time. Traditional RNA secondary structure prediction algorithms are primarily based on thermodynamic models through free energy minimization, which imposes strong prior assumptions and is slow to run. Here, we propose a deep learning-based method, called UFold, for RNA secondary structure prediction, trained directly on annotated data and base-pairing rules. UFold proposes a novel image-like representation of RNA sequences, which can be efficiently processed by Fully Convolutional Networks (FCNs). We benchmark the performance of UFold on both within- and cross-family RNA datasets. It significantly outperforms previous methods on within-family datasets, while achieving a similar performance as the traditional methods when trained and tested on distinct RNA families. UFold is also able to predict pseudoknots accurately. Its prediction is fast with an inference time of about 160 ms per sequence up to 1500 bp in length. An online web server running UFold is available at https://ufold.ics.uci.edu. Code is available at https://github.com/uci-cbcl/UFold.
-
Abstract To use next-generation sequencing technology such as RNA-seq for medical and health applications, choosing proper analysis methods for biomarker identification remains a critical challenge for most users. The US Food and Drug Administration (FDA) has led the Sequencing Quality Control (SEQC) project to conduct a comprehensive investigation of 278 representative RNA-seq data analysis pipelines consisting of 13 sequence mapping, three quantification, and seven normalization methods. In this article, we focused on the impact of the joint effects of RNA-seq pipelines on gene expression estimation as well as the downstream prediction of disease outcomes. First, we developed and applied three metrics (i.e., accuracy, precision, and reliability) to quantitatively evaluate each pipeline’s performance on gene expression estimation. We then investigated the correlation between the proposed metrics and the downstream prediction performance using two real-world cancer datasets (i.e., SEQC neuroblastoma dataset and the NIH/NCI TCGA lung adenocarcinoma dataset). We found that RNA-seq pipeline components jointly and significantly impacted the accuracy of gene expression estimation, and its impact was extended to the downstream prediction of these cancer outcomes. Specifically, RNA-seq pipelines that produced more accurate, precise, and reliable gene expression estimation tended to perform better in the prediction of disease outcome. In the end, we provided scenarios as guidelines for users to use these three metrics to select sensible RNA-seq pipelines for the improved accuracy, precision, and reliability of gene expression estimation, which lead to the improved downstream gene expression-based prediction of disease outcome.
-
Abstract Motivation Protein structure prediction has been greatly improved by deep learning, but the contribution of different information is yet to be fully understood. This article studies the impacts of two kinds of information for structure prediction: template and multiple sequence alignment (MSA) embedding. Templates have been used by some methods before, such as AlphaFold2, RoseTTAFold and RaptorX. AlphaFold2 and RosetTTAFold only used templates detected by HHsearch, which may not perform very well on some targets. In addition, sequence embedding generated by pre-trained protein language models has not been fully explored for structure prediction. In this article, we study the impact of templates (including the number of templates, the template quality and how the templates are generated) on protein structure prediction accuracy, especially when the templates are detected by methods other than HHsearch. We also study the impact of sequence embedding (generated by MSATransformer and ESM-1b) on structure prediction.
Results We have implemented a deep learning method for protein structure prediction that may take templates and MSA embedding as extra inputs. We study the contribution of templates and MSA embedding to structure prediction accuracy. Our experimental results show that templates can improve structure prediction on 71 of 110 CASP13 (13th Critical Assessment of Structure Prediction) targets and 47 of 91 CASP14 targets, and templates are particularly useful for targets with similar templates. MSA embedding can improve structure prediction on 63 of 91 CASP14 (14th Critical Assessment of Structure Prediction) targets and 87 of 183 CAMEO targets and is particularly useful for proteins with shallow MSAs. When both templates and MSA embedding are used, our method can predict correct folds (TMscore > 0.5) for 16 of 23 CASP14 FM targets and 14 of 18 Continuous Automated Model Evaluation (CAMEO) targets, outperforming RoseTTAFold by 5% and 7%, respectively.
Availability and implementation Available at https://github.com/xluo233/RaptorXFold.
Supplementary information Supplementary data are available at Bioinformatics online.
-
Abstract The first RNA category of the Critical Assessment of Techniques for Structure Prediction competition was only made possible because of the scientists who provided experimental structures to challenge the predictors. In this article, these scientists offer a unique and valuable analysis of both the successes and areas for improvement in the predicted models. All 10 RNA‐only targets yielded predictions topologically similar to experimentally determined structures. For one target, experimentalists were able to phase their x‐ray diffraction data by molecular replacement, showing a potential application of structure predictions for RNA structural biologists. Recommended areas for improvement include: enhancing the accuracy in local interaction predictions and increased consideration of the experimental conditions such as multimerization, structure determination method, and time along folding pathways. The prediction of RNA–protein complexes remains the most significant challenge. Finally, given the intrinsic flexibility of many RNAs, we propose the consideration of ensemble models.