Abstract AlphaFold2 has revolutionized protein structure prediction from amino‐acid sequence. In addition to protein structures, high‐resolution dynamics information about various protein regions is important for understanding protein function. Although AlphaFold2 has neither been designed nor trained to predict protein dynamics, it is shown here how the information returned by AlphaFold2 can be used to predict dynamic protein regions at the individual residue level. The approach, which is termed cdsAF2, uses the 3D protein structure returned by AlphaFold2 to predict backbone NMR NHS2order parameters using a local contact model that takes into account the contacts made by each peptide plane along the backbone with its environment. By combining for each residue AlphaFold2's pLDDT confidence score for the structure prediction accuracy with the predictedS2value using the local contact model, an estimator is obtained that semi‐quantitatively captures many of the dynamics features observed in experimental backbone NMR NHS2order parameter profiles. The method is demonstrated for a set nine proteins of different sizes and variable amounts of dynamics and disorder.
more »
« less
Enzyme Substrate Classification Dataset for SDRs and SAM-MTases
This dataset contains sequence information, three-dimensional structures (from AlphaFold2 model), and substrate classification labels for 358 short-chain dehydrogenase/reductases (SDRs) and 953 S-adenosylmethionine dependent methyltransferases (SAM-MTases).</p> The aminoacid sequences of these enzymes were obtained from the UniProt Knowledgebase (https://www.uniprot.org). The sets of proteins were obtained by querying using InterPro protein family/domain identifiers corresponding to each family: IPR002347 (SDRs) and IPR029063 (SAM-MTases). The query results were filtered by UniProt annotation score, keeping only those with score above 4-out-of-5, and deduplicated by exact sequence matches.</p> The structures were submitted to the publicly available AlphaFold2 protein structure predictor (J. Jumper et al., Nature, 2021, 596, 583) using the ColabFold notebook (https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.1-premultimer/batch/AlphaFold2_batch.ipynb, M. Mirdita, S. Ovchinnikov, M. Steinegger, Nature Meth., 2022, 19, 679, https://github.com/sokrypton/ColabFold). The model settings used were msa_model = MMSeq2(Uniref+Environmental), num_models = 1, use_amber = False, use_templates = True, do_not_overwrite_results = True. The resulting PDB structures are included as ZIP archives</p> The classification labels were obtained from the substrate and product annotations of the enzyme UniProtKB records. Two approaches were used: substrate clustering based on molecular fingerprints and manual substrate type classification. For the substate clustering, Morgan fingerprints were generated for all enzymatic substrates and products with known structures (excluding cofactors) with radius = 3 using RDKit (https://rdkit.org). The fingerprints were projected onto two-dimensional space using the UMAP algorithm (L. McInnes, J. Healy, 2018, arXiv 1802.03426) and Jaccard metric and clustered using k-means. This procedure generated 9 clusters for SDR substrates and 13 clusters for SAM-MTases. The SMILES representations of the substrates are listed in the SDR_substrates_to_cluster_map_2DIMUMAP.csv and SAM_substrates_to_13clusters_map_2DIMUMAP.csv files.</p> The following manually defined classification tasks are included for SDRs: NADP/NAD cofactor classification; phenol substrate, sterol substrate, coenzyme A (CoA) substrate. For SAM-MTases, the manually defined classification tasks are: biopolymer (protein/RNA/DNA) vs. small molecule substrate, phenol subsrates, sterol substrates, nitrogen heterocycle substrates. The SMARTS strings used to define the substrate classes are listed in substructure_search_SMARTS.docx. </p>
more »
« less
- Award ID(s):
- 2227112
- PAR ID:
- 10415321
- Publisher / Repository:
- Zenodo
- Date Published:
- Subject(s) / Keyword(s):
- Enzyme substrate Proteins Machine learning Classification
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This archive contains COAWST model input, grids and initial conditions, and output used to produce the results in a submitted manuscript. The files are:</p> model_input.zip: input files for simulations presented in this paper ocean_rip_current.in: ROMS ocean model input file swan_rip_current.in: SWAN wave model input file (example with Hs=1m) coupling_rip_current.in: model coupling file rip_current.h: model header file model_grids_forcing.zip: bathymetry and initial condition files hbeach_grid_isbathy_2m.nc: ROMS bathymetry input file hbeach_grid_isbathy_2m.bot: SWAN bathymetry input file hbeach_grid_isbathy_2m.grd: SWAN grid input file hbeach_init_isbathy_14_18_17.nc: Initial temperature, cool surf zone dT=-1C case hbeach_init_isbathy_14_18_19.nc: Initial temperature, warm surf zone dT=+1C case hbeach_init_isbathy_14_18_16.nc: Initial temperature, cool surf zone dT=-2C case hbeach_init_isbathy_14_18_20.nc: Initial temperature, warm surf zone dT=+2C case hbeach_init_isbathy_14_18_17p5.nc: Initial temperature, cool surf zone dT=-0.5C case hbeach_init_isbathy_14_18_18p5.nc: Initial temperature, warm surf zone dT=+0.5C case</p> model_output files: model output used to produce the figures netcdf files, zipped variables included: x_rho (cross-shore coordinate, m) y_rho (alongshore coordinate, m) z_rho (vertical coordinate, m) ocean_time (time since initialization, s, output every 5 mins) h (bathymetry, m) temp (temperature, Celsius) dye_02 (surfzone-released dye) Hwave (wave height, m) Dissip_break (wave dissipation W/m2) ubar (cross-shore depth-average velocity, m/s, interpolated to rho-points) Case_141817.nc: cool surf zone dT=-1C Hs=1m Case_141819.nc: warm surf zone dT=+1C Hs=1m Case_141816.nc: cool surf zone dT=-2C Hs=1m Case_141820.nc: warm surf zone dT=-2C Hs=1m Case_141817p5.nc: cool surf zone dT=-0.5C Hs=1m Case_141818p5.nc: warm surf zone dT=+0.5C Hs=1m Case_141817_Hp5.nc: cool surf zone dT=-1C Hs=0.5m Case_141819_Hp5.nc: warm surf zone dT=+1C Hs=0.5m Case_141817_Hp75.nc: cool surf zone dT=-1C Hs=0.75m Case_141819_Hp75.nc: warm surf zone dT=+1C Hs=0.75m</p> COAWST is an open source code and can be download at https://coawstmodel-trac.sourcerepo.com/coawstmodel_COAWST/. Descriptions of the input and output files can be found in the manual distributed with the model code and in the glossary at the end of the ocean.in file.</p> Corresponding author: Melissa Moulton, mmoulton@uw.edu</p>more » « less
-
{"Abstract":["This dataset contains machine learning and volunteer classifications from the Gravity Spy project. It includes glitches from observing runs O1, O2, O3a and O3b that received at least one classification from a registered volunteer in the project. It also indicates glitches that are nominally retired from the project using our default set of retirement parameters, which are described below. See more details in the Gravity Spy Methods paper. <\/p>\n\nWhen a particular subject in a citizen science project (in this case, glitches from the LIGO datastream) is deemed to be classified sufficiently it is "retired" from the project. For the Gravity Spy project, retirement depends on a combination of both volunteer and machine learning classifications, and a number of parameterizations affect how quickly glitches get retired. For this dataset, we use a default set of retirement parameters, the most important of which are: <\/p>\n\nA glitches must be classified by at least 2 registered volunteers<\/li>Based on both the initial machine learning classification and volunteer classifications, the glitch has more than a 90% probability of residing in a particular class<\/li>Each volunteer classification (weighted by that volunteer's confusion matrix) contains a weight equal to the initial machine learning score when determining the final probability<\/li><\/ol>\n\nThe choice of these and other parameterization will affect the accuracy of the retired dataset as well as the number of glitches that are retired, and will be explored in detail in an upcoming publication (Zevin et al. in prep). <\/p>\n\nThe dataset can be read in using e.g. Pandas: \n```\nimport pandas as pd\ndataset = pd.read_hdf('retired_fulldata_min2_max50_ret0p9.hdf5', key='image_db')\n```\nEach row in the dataframe contains information about a particular glitch in the Gravity Spy dataset. <\/p>\n\nDescription of series in dataframe<\/strong><\/p>\n\n['1080Lines', '1400Ripples', 'Air_Compressor', 'Blip', 'Chirp', 'Extremely_Loud', 'Helix', 'Koi_Fish', 'Light_Modulation', 'Low_Frequency_Burst', 'Low_Frequency_Lines', 'No_Glitch', 'None_of_the_Above', 'Paired_Doves', 'Power_Line', 'Repeating_Blips', 'Scattered_Light', 'Scratchy', 'Tomte', 'Violin_Mode', 'Wandering_Line', 'Whistle']\n\tMachine learning scores for each glitch class in the trained model, which for a particular glitch will sum to unity<\/li><\/ul>\n\t<\/li>['ml_confidence', 'ml_label']\n\tHighest machine learning confidence score across all classes for a particular glitch, and the class associated with this score<\/li><\/ul>\n\t<\/li>['gravityspy_id', 'id']\n\tUnique identified for each glitch on the Zooniverse platform ('gravityspy_id') and in the Gravity Spy project ('id'), which can be used to link a particular glitch to the full Gravity Spy dataset (which contains GPS times among many other descriptors)<\/li><\/ul>\n\t<\/li>['retired']\n\tMarks whether the glitch is retired using our default set of retirement parameters (1=retired, 0=not retired)<\/li><\/ul>\n\t<\/li>['Nclassifications']\n\tThe total number of classifications performed by registered volunteers on this glitch<\/li><\/ul>\n\t<\/li>['final_score', 'final_label']\n\tThe final score (weighted combination of machine learning and volunteer classifications) and the most probable type of glitch<\/li><\/ul>\n\t<\/li>['tracks']\n\tArray of classification weights that were added to each glitch category due to each volunteer's classification<\/li><\/ul>\n\t<\/li><\/ul>\n\n <\/p>\n\n```\nFor machine learning classifications on all glitches in O1, O2, O3a, and O3b, please see Gravity Spy Machine Learning Classifications on Zenodo<\/p>\n\nFor the most recently uploaded training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo.<\/p>\n\nFor detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo. <\/p>"]}more » « less
-
{"Abstract":["The intended use of this archive is to facilitate meta-analysis of the Data Observation Network for Earth (DataONE, [1]). <\/p>\n\nDataONE is a distributed infrastructure that provides information about earth observation data. This dataset was derived from the DataONE network using Preston [2] between 17 October 2018 and 6 November 2018, resolving 335,213 urls at an average retrieval rate of about 5 seconds per url, or 720 files per hour, resulting in a data gzip compressed tar archive of 837.3 MB . <\/p>\n\nThe archive associates 325,757 unique metadata urls [3] to 202,063 unique ecological metadata files [4]. Also, the DataONE search index was captured to establish provenance of how the dataset descriptors were found and acquired. During the creation of the snapshot (or crawl), 15,389 urls [5], or 4.7% of urls, did not successfully resolve. <\/p>\n\nTo facilitate discovery, the record of the Preston snapshot crawl is included in the preston-ls-* files . There files are derived from the rdf/nquad file with hash://sha256/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f . This file can also be found in the data.tar.gz at data/8c/67/e0/8c67e0741d1c90db54740e08d2e39d91dfd73566ea69c1f2da0d9ab9780a9a9f/data . For more information about concepts and format, please see [2]. <\/p>\n\nTo extract all EML files from the included Preston archive, first extract the hashes assocated with EML files using:<\/p>\n\ncat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep "hash://" | sort | uniq > eml-hashes.txt<\/p>\n\nextract data.tar.gz using:<\/p>\n\n~/preston-archive$$ tar xzf data.tar.gz <\/p>\n\nthen use Preston to extract each hash using something like:<\/p>\n\n~/preston-archive$$ preston get hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa\n<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:stmml="http://www.xml-cml.org/schema/stmml_1.1" packageId="doi:10.18739/A24P9Q" system="https://arcticdata.io" scope="system" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 ~/development/eml/eml.xsd">\n <dataset>\n <alternateIdentifier>urn:x-wmo:md:org.aoncadis.www::d76bc3b5-7b19-11e4-8526-00c0f03d5b7c</alternateIdentifier>\n <alternateIdentifier>d76bc3b5-7b19-11e4-8526-00c0f03d5b7c</alternateIdentifier>\n <title>Airglow Image Data 2011 4 of 5</title>\n...<\/p>\n\nAlternatively, without using Preston, you can extract the data using the naming convention:<\/p>\n\ndata/[x]/[y]/[z]/[hash]/data<\/p>\n\nwhere x is the first 2 characters of the hash, y the second 2 characters, z the third 2 characters, and hash the full sha256 content hash of the EML file.<\/p>\n\nFor example, the hash hash://sha256/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa can be found in the file: data/00/00/2d/00002d0fc9e35a9194da7dd3d8ce25eddee40740533f5af2397d6708542b9baa/data . For more information, see [2].<\/p>\n\nThe intended use of this archive is to facilitate meta-analysis of the DataONE dataset network. <\/p>\n\n[1] DataONE, https://www.dataone.org\n[2] https://preston.guoda.bio, https://doi.org/10.5281/zenodo.1410543 . DataONE was crawled via Preston with "preston update -u https://dataone.org".\n[3] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep -v "hash://" | sort | uniq | wc -l\n[4] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep -v "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep "hash://" | sort | uniq | wc -l\n[5] cat preston-ls.tsv.gz | gunzip | grep "Version" | grep "deeplinker" | grep -v "query/solr" | cut -f1,3 | tr '\\t' '\\n' | grep -v "hash://" | sort | uniq | wc -l<\/p>\n\nThis work is funded in part by grant NSF OAC 1839201 from the National Science Foundation.<\/p>"]}more » « less
-
Abstract The aim of the UniProt Knowledgebase (UniProtKB; https://www.uniprot.org/) is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this publication, we describe ongoing changes to our production pipeline to limit the sequences available in UniProtKB to high-quality, non-redundant reference proteomes. We continue to manually curate the scientific literature to add the latest functional data and use machine learning techniques. We also encourage community curation to ensure key publications are not missed. We provide an update on the automatic annotation methods used by UniProtKB to predict information for unreviewed entries describing unstudied proteins. Finally, updates to the UniProt website are described, including a new tab linking protein to genomic information. In recognition of its value to the scientific community, the UniProt database has been awarded Global Core Biodata Resource status.more » « less
An official website of the United States government
