Abstract The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication we describe enhancements made to our data processing pipeline and to our website to adapt to an ever-increasing information content. The number of sequences in UniProtKB has risen to over 227 million and we are working towards including a reference proteome for each taxonomic group. We continue to extract detailed annotations from the literature to update or create reviewed entries, while unreviewed entries are supplemented with annotations provided by automated systems using a variety of machine-learning techniques. In addition, the scientific community continues their contributions of publications and annotations to UniProt entries of their interest. Finally, we describe our new website (https://www.uniprot.org/), designed to enhance our users’ experience and make our data easily accessible to the research community. This interface includes access to AlphaFold structures for more than 85% of all entries as well as improved visualisations for subcellular localisation of proteins.
more »
« less
UniProt: the universal protein knowledgebase in 2021
Abstract The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately 190 million, despite continued work to reduce sequence redundancy at the proteome level. We have adopted new methods of assessing proteome completeness and quality. We continue to extract detailed annotations from the literature to add to reviewed entries and supplement these in unreviewed entries with annotations provided by automated systems such as the newly implemented Association-Rule-Based Annotator (ARBA). We have developed a credit-based publication submission interface to allow the community to contribute publications and annotations to UniProt entries. We describe how UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/.
more »
« less
- Award ID(s):
- 1917302
- PAR ID:
- 10406291
- Author(s) / Creator(s):
- ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more »
- Date Published:
- Journal Name:
- Nucleic Acids Research
- Volume:
- 49
- Issue:
- D1
- ISSN:
- 0305-1048
- Page Range / eLocation ID:
- D480 to D489
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The American Sign Language Linguistic Research Project (ASLLRP) provides Internet access to high-quality ASL video data, generally including front and side views and a close-up of the face. The manual and non-manual components of the signing have been linguistically annotated using SignStream(R). The recently expanded video corpora can be browsed and searched through the Data Access Interface (DAI 2) we have designed; it is possible to carry out complex searches. The data from our corpora can also be downloaded; annotations are available in an XML export format. We have also developed the ASLLRP Sign Bank, which contains almost 6,000 sign entries for lexical signs, with distinct English-based glosses, with a total of 41,830 examples of lexical signs (in addition to about 300 gestures, over 1,000 fingerspelled signs, and 475 classifier examples). The Sign Bank is likewise accessible and searchable on the Internet; it can also be accessed from within SignStream(R) (software to facilitate linguistic annotation and analysis of visual language data) to make annotations more accurate and efficient. Here we describe the available resources. These data have been used for many types of research in linguistics and in computer-based sign language recognition from video; examples of such research are provided in the latter part of this article.more » « less
-
null (Ed.)Abstract We present DescribePROT, the database of predicted amino acid-level descriptors of structure and function of proteins. DescribePROT delivers a comprehensive collection of 13 complementary descriptors predicted using 10 popular and accurate algorithms for 83 complete proteomes that cover key model organisms. The current version includes 7.8 billion predictions for close to 600 million amino acids in 1.4 million proteins. The descriptors encompass sequence conservation, position specific scoring matrix, secondary structure, solvent accessibility, intrinsic disorder, disordered linkers, signal peptides, MoRFs and interactions with proteins, DNA and RNAs. Users can search DescribePROT by the amino acid sequence and the UniProt accession number and entry name. The pre-computed results are made available instantaneously. The predictions can be accesses via an interactive graphical interface that allows simultaneous analysis of multiple descriptors and can be also downloaded in structured formats at the protein, proteome and whole database scale. The putative annotations included by DescriPROT are useful for a broad range of studies, including: investigations of protein function, applied projects focusing on therapeutics and diseases, and in the development of predictors for other protein sequence descriptors. Future releases will expand the coverage of DescribePROT. DescribePROT can be accessed at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/.more » « less
-
Abstract MotivationAs fewer than 1% of proteins have protein function information determined experimentally, computationally predicting the function of proteins is critical for obtaining functional information for most proteins and has been a major challenge in protein bioinformatics. Despite the significant progress made in protein function prediction by the community in the last decade, the general accuracy of protein function prediction is still not high, particularly for rare function terms associated with few proteins in the protein function annotation database such as the UniProt. ResultsWe introduce TransFew, a new transformer model, to learn the representations of both protein sequences and function labels [Gene Ontology (GO) terms] to predict the function of proteins. TransFew leverages a large pre-trained protein language model (ESM2-t48) to learn function-relevant representations of proteins from raw protein sequences and uses a biological natural language model (BioBert) and a graph convolutional neural network-based autoencoder to generate semantic representations of GO terms from their textual definition and hierarchical relationships, which are combined together to predict protein function via the cross-attention. Integrating the protein sequence and label representations not only enhances overall function prediction accuracy, but delivers a robust performance of predicting rare function terms with limited annotations by facilitating annotation transfer between GO terms. Availability and implementationhttps://github.com/BioinfoMachineLearning/TransFew.more » « less
-
This dataset contains sequence information, three-dimensional structures (from AlphaFold2 model), and substrate classification labels for 358 short-chain dehydrogenase/reductases (SDRs) and 953 S-adenosylmethionine dependent methyltransferases (SAM-MTases).</p> The aminoacid sequences of these enzymes were obtained from the UniProt Knowledgebase (https://www.uniprot.org). The sets of proteins were obtained by querying using InterPro protein family/domain identifiers corresponding to each family: IPR002347 (SDRs) and IPR029063 (SAM-MTases). The query results were filtered by UniProt annotation score, keeping only those with score above 4-out-of-5, and deduplicated by exact sequence matches.</p> The structures were submitted to the publicly available AlphaFold2 protein structure predictor (J. Jumper et al., Nature, 2021, 596, 583) using the ColabFold notebook (https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.1-premultimer/batch/AlphaFold2_batch.ipynb, M. Mirdita, S. Ovchinnikov, M. Steinegger, Nature Meth., 2022, 19, 679, https://github.com/sokrypton/ColabFold). The model settings used were msa_model = MMSeq2(Uniref+Environmental), num_models = 1, use_amber = False, use_templates = True, do_not_overwrite_results = True. The resulting PDB structures are included as ZIP archives</p> The classification labels were obtained from the substrate and product annotations of the enzyme UniProtKB records. Two approaches were used: substrate clustering based on molecular fingerprints and manual substrate type classification. For the substate clustering, Morgan fingerprints were generated for all enzymatic substrates and products with known structures (excluding cofactors) with radius = 3 using RDKit (https://rdkit.org). The fingerprints were projected onto two-dimensional space using the UMAP algorithm (L. McInnes, J. Healy, 2018, arXiv 1802.03426) and Jaccard metric and clustered using k-means. This procedure generated 9 clusters for SDR substrates and 13 clusters for SAM-MTases. The SMILES representations of the substrates are listed in the SDR_substrates_to_cluster_map_2DIMUMAP.csv and SAM_substrates_to_13clusters_map_2DIMUMAP.csv files.</p> The following manually defined classification tasks are included for SDRs: NADP/NAD cofactor classification; phenol substrate, sterol substrate, coenzyme A (CoA) substrate. For SAM-MTases, the manually defined classification tasks are: biopolymer (protein/RNA/DNA) vs. small molecule substrate, phenol subsrates, sterol substrates, nitrogen heterocycle substrates. The SMARTS strings used to define the substrate classes are listed in substructure_search_SMARTS.docx. </p>more » « less