Abstract ColabFold offers accelerated prediction of protein structures and complexes by combining the fast homology search of MMseqs2 with AlphaFold2 or RoseTTAFold. ColabFold’s 40−60-fold faster search and optimized model utilization enables prediction of close to 1,000 structures per day on a server with one graphics processing unit. Coupled with Google Colaboratory, ColabFold becomes a free and accessible platform for protein folding. ColabFold is open-source software available at https://github.com/sokrypton/ColabFold and its novel environmental databases are available at https://colabfold.mmseqs.com .
more »
« less
AlphaFold Accurately Predicts the Structure of Ribosomally Synthesized and Post-Translationally Modified Peptide Biosynthetic Enzymes
Ribosomally synthesized and post-translationally modified peptides (RiPPs) are a growing class of natural products biosynthesized from a genetically encoded precursor peptide. The enzymes that install the post-translational modifications on these peptides have the potential to be useful catalysts in the production of natural-product-like compounds and can install non-proteogenic amino acids in peptides and proteins. However, engineering these enzymes has been somewhat limited, due in part to limited structural information on enzymes in the same families that nonetheless exhibit different substrate selectivities. Despite AlphaFold2’s superior performance in single-chain protein structure prediction, its multimer version lacks accuracy and requires high-end GPUs, which are not typically available to most research groups. Additionally, the default parameters of AlphaFold2 may not be optimal for predicting complex structures like RiPP biosynthetic enzymes, due to their dynamic binding and substrate-modifying mechanisms. This study assessed the efficacy of the structure prediction program ColabFold (a variant of AlphaFold2) in modeling RiPP biosynthetic enzymes in both monomeric and dimeric forms. After extensive benchmarking, it was found that there were no statistically significant differences in the accuracy of the predicted structures, regardless of the various possible prediction parameters that were examined, and that with the default parameters, ColabFold was able to produce accurate models. We then generated additional structural predictions for select RiPP biosynthetic enzymes from multiple protein families and biosynthetic pathways. Our findings can serve as a reference for future enzyme engineering complemented by AlphaFold-related tools.
more »
« less
- Award ID(s):
- 2216836
- PAR ID:
- 10508038
- Publisher / Repository:
- MDPI
- Date Published:
- Journal Name:
- Biomolecules
- Volume:
- 13
- Issue:
- 8
- ISSN:
- 2218-273X
- Page Range / eLocation ID:
- 1243
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This dataset contains sequence information, three-dimensional structures (from AlphaFold2 model), and substrate classification labels for 358 short-chain dehydrogenase/reductases (SDRs) and 953 S-adenosylmethionine dependent methyltransferases (SAM-MTases).</p> The aminoacid sequences of these enzymes were obtained from the UniProt Knowledgebase (https://www.uniprot.org). The sets of proteins were obtained by querying using InterPro protein family/domain identifiers corresponding to each family: IPR002347 (SDRs) and IPR029063 (SAM-MTases). The query results were filtered by UniProt annotation score, keeping only those with score above 4-out-of-5, and deduplicated by exact sequence matches.</p> The structures were submitted to the publicly available AlphaFold2 protein structure predictor (J. Jumper et al., Nature, 2021, 596, 583) using the ColabFold notebook (https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.1-premultimer/batch/AlphaFold2_batch.ipynb, M. Mirdita, S. Ovchinnikov, M. Steinegger, Nature Meth., 2022, 19, 679, https://github.com/sokrypton/ColabFold). The model settings used were msa_model = MMSeq2(Uniref+Environmental), num_models = 1, use_amber = False, use_templates = True, do_not_overwrite_results = True. The resulting PDB structures are included as ZIP archives</p> The classification labels were obtained from the substrate and product annotations of the enzyme UniProtKB records. Two approaches were used: substrate clustering based on molecular fingerprints and manual substrate type classification. For the substate clustering, Morgan fingerprints were generated for all enzymatic substrates and products with known structures (excluding cofactors) with radius = 3 using RDKit (https://rdkit.org). The fingerprints were projected onto two-dimensional space using the UMAP algorithm (L. McInnes, J. Healy, 2018, arXiv 1802.03426) and Jaccard metric and clustered using k-means. This procedure generated 9 clusters for SDR substrates and 13 clusters for SAM-MTases. The SMILES representations of the substrates are listed in the SDR_substrates_to_cluster_map_2DIMUMAP.csv and SAM_substrates_to_13clusters_map_2DIMUMAP.csv files.</p> The following manually defined classification tasks are included for SDRs: NADP/NAD cofactor classification; phenol substrate, sterol substrate, coenzyme A (CoA) substrate. For SAM-MTases, the manually defined classification tasks are: biopolymer (protein/RNA/DNA) vs. small molecule substrate, phenol subsrates, sterol substrates, nitrogen heterocycle substrates. The SMARTS strings used to define the substrate classes are listed in substructure_search_SMARTS.docx. </p>more » « less
-
Abstract Since its public release in 2021, AlphaFold2 (AF2) has made investigating biological questions, using predicted protein structures of single monomers or full complexes, a common practice. ColabFold-AF2 is an open-source Jupyter Notebook inside Google Colaboratory and a command-line tool, which makes it easy to use AF2, while exposing its advanced options. ColabFold-AF2 shortens turn-around times of experiments due to its optimized usage of AF2’s models. In this protocol, we guide the reader through ColabFold best-practices using three scenarios: (1) monomer prediction, (2) complex prediction, and (3) conformation sampling. The first two scenarios cover classic static structure prediction and are demonstrated on the human glycosylphosphatidylinositol transamidase (GPIT) protein. The third scenario demonstrates an alternative use-case of the AF2 models by predicting two conformations of the human Alanine Serine Transporter 2 (ASCT2). Users can run the protocol without command-line knowledge via Google Colaboratory or in a command-line environment. The protocol is available at https://protocol.colabfold.com.more » « less
-
Cowen, Lenore (Ed.)MotivationQuality assessment (QA) of predicted protein tertiary structure models plays an important role in ranking and using them. With the recent development of deep learning end-to-end protein structure prediction techniques for generating highly confident tertiary structures for most proteins, it is important to explore corresponding QA strategies to evaluate and select the structural models predicted by them since these models have better quality and different properties than the models predicted by traditional tertiary structure prediction methods. ResultsWe develop EnQA, a novel graph-based 3D-equivariant neural network method that is equivariant to rotation and translation of 3D objects to estimate the accuracy of protein structural models by leveraging the structural features acquired from the state-of-the-art tertiary structure prediction method—AlphaFold2. We train and test the method on both traditional model datasets (e.g. the datasets of the Critical Assessment of Techniques for Protein Structure Prediction) and a new dataset of high-quality structural models predicted only by AlphaFold2 for the proteins whose experimental structures were released recently. Our approach achieves state-of-the-art performance on protein structural models predicted by both traditional protein structure prediction methods and the latest end-to-end deep learning method—AlphaFold2. It performs even better than the model QA scores provided by AlphaFold2 itself. The results illustrate that the 3D-equivariant graph neural network is a promising approach to the evaluation of protein structural models. Integrating AlphaFold2 features with other complementary sequence and structural features is important for improving protein model QA. Availability and implementationThe source code is available at https://github.com/BioinfoMachineLearning/EnQA. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract Background Halogenation is a recurring feature in natural products, especially those from marine organisms. The selectivity with which halogenating enzymes act on their substrates renders halogenases interesting targets for biocatalyst development. Recently, CylC – the first predicted dimetal-carboxylate halogenase to be characterized – was shown to regio- and stereoselectively install a chlorine atom onto an unactivated carbon center during cylindrocyclophane biosynthesis. Homologs of CylC are also found in other characterized cyanobacterial secondary metabolite biosynthetic gene clusters. Due to its novelty in biological catalysis, selectivity and ability to perform C-H activation, this halogenase class is of considerable fundamental and applied interest. The study of CylC-like enzymes will provide insights into substrate scope, mechanism and catalytic partners, and will also enable engineering these biocatalysts for similar or additional C-H activating functions. Still, little is known regarding the diversity and distribution of these enzymes. Results In this study, we used both genome mining and PCR-based screening to explore the genetic diversity of CylC homologs and their distribution in bacteria. While we found non-cyanobacterial homologs of these enzymes to be rare, we identified a large number of genes encoding CylC-like enzymes in publicly available cyanobacterial genomes and in our in-house culture collection of cyanobacteria. Genes encoding CylC homologs are widely distributed throughout the cyanobacterial tree of life, within biosynthetic gene clusters of distinct architectures (combination of unique gene groups). These enzymes are found in a variety of biosynthetic contexts, which include fatty-acid activating enzymes, type I or type III polyketide synthases, dialkylresorcinol-generating enzymes, monooxygenases or Rieske proteins. Our study also reveals that dimetal-carboxylate halogenases are among the most abundant types of halogenating enzymes in the phylum Cyanobacteria. Conclusions Our data show that dimetal-carboxylate halogenases are widely distributed throughout the Cyanobacteria phylum and that BGCs encoding CylC homologs are diverse and mostly uncharacterized. This work will help guide the search for new halogenating biocatalysts and natural product scaffolds.more » « less
An official website of the United States government

