skip to main content


Title: DescribePROT: database of amino acid-level protein structure and function predictions
Abstract We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.  more » « less
Award ID(s):
1661391
NSF-PAR ID:
10295754
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Nucleic Acids Research
Volume:
49
Issue:
D1
ISSN:
0305-1048
Page Range / eLocation ID:
D298 to D308
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The DescribePROT database of amino acid-level descriptors of protein structures and functions was substantially expanded since its release in 2020. This expansion includes substantial increase in the size, scope, and quality of the underlying data, the addition of experimental structural information, the inclusion of new data download options, and an upgraded graphical interface. DescribePROT currently covers 19 structural and functional descriptors for proteins in 273 reference proteomes generated by 11 accurate and complementary predictive tools. Users can search our resource in multiple ways, interact with the data using the graphical interface, and download data at various scales including individual proteins, entire proteomes, and whole database. The annotations in DescribePROT are useful for a broad spectrum of studies that include investigations of protein structure and function, development and validation of predictive tools, and to support efforts in understanding molecular underpinnings of diseases and development of therapeutics. DescribePROT can be freely accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.

     
    more » « less
  2. Abstract

    We have developed an algorithm, ParSe, which accurately identifies from the primary sequence those protein regions likely to exhibit physiological phase separation behavior. Originally, ParSe was designed to test the hypothesis that, for flexible proteins, phase separation potential is correlated to hydrodynamic size. While our results were consistent with that idea, we also found that many different descriptors could successfully differentiate between three classes of protein regions: folded, intrinsically disordered, and phase‐separating intrinsically disordered. Consequently, numerous combinations of amino acid property scales can be used to make robust predictions of protein phase separation. Built from that finding, ParSe 2.0 uses an optimal set of property scales to predict domain‐level organization and compute a sequence‐based prediction of phase separation potential. The algorithm is fast enough to scan the whole of the human proteome in minutes on a single computer and is equally or more accurate than other published predictors in identifying proteins and regions within proteins that drive phase separation. Here, we describe a web application for ParSe 2.0 that may be accessed through a browser by visitinghttps://stevewhitten.github.io/Parse_v2_FASTAto quickly identify phase‐separating proteins within large sequence sets, or by visitinghttps://stevewhitten.github.io/Parse_v2_webto evaluate individual protein sequences.

     
    more » « less
  3. Residues in proteins that are in close spatial proximity are more prone to covariate as their interactions are likely to be preserved due to structural and evolutionary constraints. If we can detect and quantify such covariation, physical contacts may then be predicted in the structure of a protein solely from the sequences that decorate it. To carry out such predictions, and following the work of others, we have implemented a multivariate Gaussian model to analyze correlation in multiple sequence alignments. We have explored and tested several numerical encodings of amino acids within this model. We have shown that 1D encodings based on amino acid biochemical and biophysical properties, as well as higher dimensional encodings computed from the principal components of experimentally derived mutation/substitution matrices, do not perform as well as a simple twenty dimensional encoding with each amino acid represented with a vector of one along its own dimension and zero elsewhere. The optimum obtained from representations based on substitution matrices is reached by using 10 to 12 principal components; the corresponding performance is less than the performance obtained with the 20-dimensional binary encoding. We highlight also the importance of the prior when constructing the multivariate Gaussian model of a multiple sequence alignment. 
    more » « less
  4. ResNet and, more recently, AlphaFold2 have demonstrated that deep neural networks can now predict a tertiary structure of a given protein amino-acid sequence with high accuracy. This seminal development will allow molecular biology researchers to advance various studies linking sequence, structure, and function. Many studies will undoubtedly focus on the impact of sequence mutations on stability, fold, and function. In this paper, we evaluate the ability of AlphaFold2 to predict accurate tertiary structures of wildtype and mutated sequences of protein molecules. We do so on a benchmark dataset in mutation modeling studies. Our empirical evaluation utilizes global and local structure analyses and yields several interesting observations. It shows, for instance, that AlphaFold2 performs similarly on wildtype and variant sequences. The placement of the main chain of a protein molecule is highly accurate. However, while AlphaFold2 reports similar confidence in its predictions over wildtype and variant sequences, its performance on placements of the side chains suffers in comparison to main-chain predictions. The analysis overall supports the premise that AlphaFold2-predicted structures can be utilized in further downstream tasks, but that further refinement of these structures may be necessary.

     
    more » « less
  5. Abstract

    AlphaFold2 has revolutionized protein structure prediction from amino‐acid sequence. In addition to protein structures, high‐resolution dynamics information about various protein regions is important for understanding protein function. Although AlphaFold2 has neither been designed nor trained to predict protein dynamics, it is shown here how the information returned by AlphaFold2 can be used to predict dynamic protein regions at the individual residue level. The approach, which is termed cdsAF2, uses the 3D protein structure returned by AlphaFold2 to predict backbone NMR NHS2order parameters using a local contact model that takes into account the contacts made by each peptide plane along the backbone with its environment. By combining for each residue AlphaFold2's pLDDT confidence score for the structure prediction accuracy with the predictedS2value using the local contact model, an estimator is obtained that semi‐quantitatively captures many of the dynamics features observed in experimental backbone NMR NHS2order parameter profiles. The method is demonstrated for a set nine proteins of different sizes and variable amounts of dynamics and disorder.

     
    more » « less