NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Detecting and correcting misclassified sequences in the large-scale public databases

https://doi.org/10.1093/bioinformatics/btaa586

Bagheri, Hamid; Severin, Andrew; Rajan, Hridesh; Elofsson, Arne (June 2020, Bioinformatics)

Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the non-redundant (NR) database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than 2 million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability Source code, dataset, documentation, Jupyter notebooks, and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
FASPR: an open-source tool for fast and accurate protein side-chain packing

https://doi.org/10.1093/bioinformatics/btaa234

Huang, Xiaoqiang; Pearce, Robin; Zhang, Yang; Elofsson, Arne (April 2020, Bioinformatics)

Abstract Motivation Protein structure and function are essentially determined by how the side-chain atoms interact with each other. Thus, accurate protein side-chain packing (PSCP) is a critical step toward protein structure prediction and protein design. Despite the importance of the problem, however, the accuracy and speed of current PSCP programs are still not satisfactory. Results We present FASPR for fast and accurate PSCP by using an optimized scoring function in combination with a deterministic searching algorithm. The performance of FASPR was compared with four state-of-the-art PSCP methods (CISRR, RASP, SCATD and SCWRL4) on both native and non-native protein backbones. For the assessment on native backbones, FASPR achieved a good performance by correctly predicting 69.1% of all the side-chain dihedral angles using a stringent tolerance criterion of 20°, compared favorably with SCWRL4, CISRR, RASP and SCATD which successfully predicted 68.8%, 68.6%, 67.8% and 61.7%, respectively. Additionally, FASPR achieved the highest speed for packing the 379 test protein structures in only 34.3 s, which was significantly faster than the control methods. For the assessment on non-native backbones, FASPR showed an equivalent or better performance on I-TASSER predicted backbones and the backbones perturbed from experimental structures. Detailed analyses showed that the major advantage of FASPR lies in the optimal combination of the dead-end elimination and tree decomposition with a well optimized scoring function, which makes FASPR of practical use for both protein structure modeling and protein design studies. Availability and implementation The web server, source code and datasets are freely available at https://zhanglab.ccmb.med.umich.edu/FASPR and https://github.com/tommyhuangthu/FASPR. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
FUpred: detecting protein domains through deep-learning-based contact map prediction

https://doi.org/10.1093/bioinformatics/btaa217

Zheng, Wei; Zhou, Xiaogen; Wuyun, Qiqige; Pearce, Robin; Li, Yang; Zhang, Yang; Elofsson, Arne (March 2020, Bioinformatics)

Abstract Motivation Protein domains are subunits that can fold and function independently. Correct domain boundary assignment is thus a critical step toward accurate protein structure and function analyses. There is, however, no efficient algorithm available for accurate domain prediction from sequence. The problem is particularly challenging for proteins with discontinuous domains, which consist of domain segments that are separated along the sequence. Results We developed a new algorithm, FUpred, which predicts protein domain boundaries utilizing contact maps created by deep residual neural networks coupled with coevolutionary precision matrices. The core idea of the algorithm is to retrieve domain boundary locations by maximizing the number of intra-domain contacts, while minimizing the number of inter-domain contacts from the contact maps. FUpred was tested on a large-scale dataset consisting of 2549 proteins and generated correct single- and multi-domain classifications with a Matthew’s correlation coefficient of 0.799, which was 19.1% (or 5.3%) higher than the best machine learning (or threading)-based method. For proteins with discontinuous domains, the domain boundary detection and normalized domain overlapping scores of FUpred were 0.788 and 0.521, respectively, which were 17.3% and 23.8% higher than the best control method. The results demonstrate a new avenue to accurately detect domain composition from sequence alone, especially for discontinuous, multi-domain proteins. Availability and implementation https://zhanglab.ccmb.med.umich.edu/FUpred. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Analysis of several key factors influencing deep learning-based inter-residue contact prediction

https://doi.org/10.1093/bioinformatics/btz679

Wu, Tianqi; Hou, Jie; Adhikari, Badri; Cheng, Jianlin; Elofsson, Arne (August 2019, Bioinformatics)

Abstract Motivation Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated. Results We analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction. Availability and implementation https://github.com/multicom-toolbox/DNCON2/. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
PISA-SPARKY: an interactive SPARKY plugin to analyze oriented solid-state NMR spectra of helical membrane proteins

https://doi.org/10.1093/bioinformatics/btaa019

Weber, Daniel K; Wang, Songlin; Markley, John L; Veglia, Gianluigi; Lee, Woonghee; Elofsson, Arne (January 2020, Bioinformatics)

Abstract Motivation Two-dimensional [15N-1H] separated local field solid-state nuclear magnetic resonance (NMR) experiments of membrane proteins aligned in lipid bilayers provide tilt and rotation angles for α-helical segments using Polar Index Slant Angle (PISA)-wheel models. No integrated software has been made available for data analysis and visualization. Results We have developed the PISA-SPARKY plugin to seamlessly integrate PISA-wheel modeling into the NMRFAM-SPARKY platform. The plugin performs basic simulations, exhaustive fitting against experimental spectra, error analysis and dipolar and chemical shift wave plotting. The plugin also supports PyMOL integration and handling of parameters that describe variable alignment and dynamic scaling encountered with magnetically aligned media, ensuring optimal fitting and generation of restraints for structure calculation. Availability and implementation PISA-SPARKY is freely available in the latest version of NMRFAM-SPARKY from the National Magnetic Resonance Facility at Madison (http://pine.nmrfam.wisc.edu/download_packages.html), the NMRbox Project (https://nmrbox.org) and to subscribers of the SBGrid (https://sbgrid.org). The pisa.py script is available and documented on GitHub (https://github.com/weberdak/pisa.py) along with a tutorial video and sample data. Supplementary information Supplementary data are available at Bioinformatics online.
more » « less
Full Text Available
Estimation of model accuracy in CASP13

https://doi.org/10.1002/prot.25767

Cheng, Jianlin; Choe, Myong‐Ho; Elofsson, Arne; Han, Kun‐Sop; Hou, Jie; Maghrabi, Ali H. A.; McGuffin, Liam J.; Menéndez‐Hurtado, David; Olechnovič, Kliment; Schwede, Torsten; et al (July 2019, Proteins: Structure, Function, and Bioinformatics)

Abstract Methods to reliably estimate the accuracy of 3D models of proteins are both a fundamental part of most protein folding pipelines and important for reliable identification of the best models when multiple pipelines are used. Here, we describe the progress made from CASP12 to CASP13 in the field of estimation of model accuracy (EMA) as seen from the progress of the most successful methods in CASP13. We show small but clear progress, that is, several methods perform better than the best methods from CASP12 when tested on CASP13 EMA targets. Some progress is driven by applying deep learning and residue‐residue contacts to model accuracy prediction. We show that the best EMA methods select better models than the best servers in CASP13, but that there exists a great potential to improve this further. Also, according to the evaluation criteria based on local similarities, such as lDDT and CAD, it is now clear that single model accuracy methods perform relatively better than consensus‐based methods.
more » « less

Search for: All records