skip to main content

Title: Coarse-graining protein structures into their dynamic communities with DCI, a dynamic community identifier
Abstract Summary

A new dynamic community identifier (DCI) is presented that relies upon protein residue dynamic cross-correlations generated by Gaussian elastic network models to identify those residue clusters exhibiting motions within a protein. A number of examples of communities are shown for diverse proteins, including GPCRs. It is a tool that can immediately simplify and clarify the most essential functional moving parts of any given protein. Proteins usually can be subdivided into groups of residues that move as communities. These are usually densely packed local sub-structures, but in some cases can be physically distant residues identified to be within the same community. The set of these communities for each protein are the moving parts. The ways in which these are organized overall can aid in understanding many aspects of functional dynamics and allostery. DCI enables a more direct understanding of functions including enzyme activity, action across membranes and changes in the community structure from mutations or ligand binding. The DCI server is freely available on a web site (https://dci.bb.iastate.edu/).

Supplementary information

Supplementary data are available at Bioinformatics online.

Authors:
; ; ; ;
Publication Date:
NSF-PAR ID:
10367383
Journal Name:
Bioinformatics
Volume:
38
Issue:
10
Page Range or eLocation-ID:
p. 2727-2733
ISSN:
1367-4803
Publisher:
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    The analysis of sequence conservation patterns has been widely utilized to identify functionally important (catalytic and ligand-binding) protein residues for over a half-century. Despite decades of development, on average state-of-the-art non-template-based functional residue prediction methods must predict ∼25% of a protein’s total residues to correctly identify half of the protein’s functional site residues. The overwhelming proportion of false positives results in reported ‘F-Scores’ of ∼0.3. We investigated the limits of current approaches, focusing on the so-far neglected impact of the specific choice of homologs included in multiple sequence alignments (MSAs).

    Results

    The limits of conservation-based functional residue prediction were explored by surveying the binding sites of 1023 proteins. A straightforward conservation analysis of MSAs composed of randomly selected homologs sampled from a PSI-BLAST search achieves average F-Scores of ∼0.3, a performance matching that reported by state-of-the-art methods, which often consider additional features for the prediction in a machine learning setting. Interestingly, we found that a simple combinatorial MSA sampling algorithm will in almost every case produce an MSA with an optimal set of homologs whose conservation analysis reaches average F-Scores of ∼0.6, doubling state-of-the-art performance. We also show that this is nearly at the theoretical limit of possible performance givenmore »the agreement between different binding site definitions. Additionally, we showcase the progress in this direction made by Selection of Alignment by Maximal Mutual Information (SAMMI), an information-theory-based approach to identifying biologically informative MSAs. This work highlights the importance and the unused potential of optimally composed MSAs for conservation analysis.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  2. Abstract Background

    Advances in microbiome science are being driven in large part due to our ability to study and infer microbial ecology from genomes reconstructed from mixed microbial communities using metagenomics and single-cell genomics. Such omics-based techniques allow us to read genomic blueprints of microorganisms, decipher their functional capacities and activities, and reconstruct their roles in biogeochemical processes. Currently available tools for analyses of genomic data can annotate and depict metabolic functions to some extent; however, no standardized approaches are currently available for the comprehensive characterization of metabolic predictions, metabolite exchanges, microbial interactions, and microbial contributions to biogeochemical cycling.

    Results

    We present METABOLIC (METabolic And BiogeOchemistry anaLyses In miCrobes), a scalable software to advance microbial ecology and biogeochemistry studies using genomes at the resolution of individual organisms and/or microbial communities. The genome-scale workflow includes annotation of microbial genomes, motif validation of biochemically validated conserved protein residues, metabolic pathway analyses, and calculation of contributions to individual biogeochemical transformations and cycles. The community-scale workflow supplements genome-scale analyses with determination of genome abundance in the microbiome, potential microbial metabolic handoffs and metabolite exchange, reconstruction of functional networks, and determination of microbial contributions to biogeochemical cycles. METABOLIC can take input genomes from isolates, metagenome-assembled genomes, ormore »single-cell genomes. Results are presented in the form of tables for metabolism and a variety of visualizations including biogeochemical cycling potential, representation of sequential metabolic transformations, community-scale microbial functional networks using a newly defined metric “MW-score” (metabolic weight score), and metabolic Sankey diagrams. METABOLIC takes ~ 3 h with 40 CPU threads to process ~ 100 genomes and corresponding metagenomic reads within which the most compute-demanding part of hmmsearch takes ~ 45 min, while it takes ~ 5 h to complete hmmsearch for ~ 3600 genomes. Tests of accuracy, robustness, and consistency suggest METABOLIC provides better performance compared to other software and online servers. To highlight the utility and versatility of METABOLIC, we demonstrate its capabilities on diverse metagenomic datasets from the marine subsurface, terrestrial subsurface, meadow soil, deep sea, freshwater lakes, wastewater, and the human gut.

    Conclusion

    METABOLIC enables the consistent and reproducible study of microbial community ecology and biogeochemistry using a foundation of genome-informed microbial metabolism, and will advance the integration of uncultivated organisms into metabolic and biogeochemical models. METABOLIC is written in Perl and R and is freely available under GPLv3 athttps://github.com/AnantharamanLab/METABOLIC.

    « less
  3. Abstract Motivation

    Due to the nature of experimental annotation, most protein function prediction methods operate at the protein-level, where functions are assigned to full-length proteins based on overall similarities. However, most proteins function by interacting with other proteins or molecules, and many functional associations should be limited to specific regions rather than the entire protein length. Most domain-centric function prediction methods depend on accurate domain family assignments to infer relationships between domains and functions, with regions that are unassigned to a known domain-family left out of functional evaluation. Given the abundance of residue-level annotations currently available, we present a function prediction methodology that automatically infers function labels of specific protein regions using protein-level annotations and multiple types of region-specific features.

    Results

    We apply this method to local features obtained from InterPro, UniProtKB and amino acid sequences and show that this method improves both the accuracy and region-specificity of protein function transfer and prediction. We compare region-level predictive performance of our method against that of a whole-protein baseline method using proteins with structurally verified binding sites and also compare protein-level temporal holdout predictive performances to expand the variety and specificity of GO terms we could evaluate. Our results can also serve as amore »starting point to categorize GO terms into region-specific and whole-protein terms and select prediction methods for different classes of GO terms.

    Availability and implementation

    The code and features are freely available at: https://github.com/ek1203/rsfp.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  4. Understanding the underlying mechanisms behind protein allostery and non-additivity of substitution outcomes (i.e., epistasis) is critical when attempting to predict the functional impact of mutations, particularly at non-conserved sites. In an effort to model these two biological properties, we extend the framework of our metric to calculate dynamic coupling between residues, the Dynamic Coupling Index (DCI) to two new metrics: (i) EpiScore, which quantifies the difference between the residue fluctuation response of a functional site when two other positions are perturbed with random Brownian kicks simultaneously versus individually to capture the degree of cooperativity of these two other positions in modulating the dynamics of the functional site and (ii) DCIasym, which measures the degree of asymmetry between the residue fluctuation response of two sites when one or the other is perturbed with a random force. Applied to four independent systems, we successfully show that EpiScore and DCIasym can capture important biophysical properties in dual mutant substitution outcomes. We propose that allosteric regulation and the mechanisms underlying non-additive amino acid substitution outcomes (i.e., epistasis) can be understood as emergent properties of an anisotropic network of interactions where the inclusion of the full network of interactions is critical for accurate modeling. Consequently,more »mutations which drive towards a new function may require a fine balance between functional site asymmetry and strength of dynamic coupling with the functional sites. These two tools will provide mechanistic insight into both understanding and predicting the outcome of dual mutations.« less
  5. Abstract Motivation

    Glycan microarrays are capable of illuminating the interactions of glycan-binding proteins (GBPs) against hundreds of defined glycan structures, and have revolutionized the investigations of protein–carbohydrate interactions underlying numerous critical biological activities. However, it is difficult to interpret microarray data and identify structural determinants promoting glycan binding to glycan-binding proteins due to the ambiguity in microarray fluorescence intensity and complexity in branched glycan structures. To facilitate analysis of glycan microarray data alongside protein structure, we have built the Glycan Microarray Database (GlyMDB), a web-based resource including a searchable database of glycan microarray samples and a toolset for data/structure analysis.

    Results

    The current GlyMDB provides data visualization and glycan-binding motif discovery for 5203 glycan microarray samples collected from the Consortium for Functional Glycomics. The unique feature of GlyMDB is to link microarray data to PDB structures. The GlyMDB provides different options for database query, and allows users to upload their microarray data for analysis. After search or upload is complete, users can choose the criterion for binder versus non-binder classification. They can view the signal intensity graph including the binder/non-binder threshold followed by a list of glycan-binding motifs. One can also compare the fluorescence intensity data from two different microarray samples. Amore »protein sequence-based search is performed using BLAST to match microarray data with all available PDB structures containing glycans. The glycan ligand information is displayed, and links are provided for structural visualization and redirection to other modules in GlycanStructure.ORG for further investigation of glycan-binding sites and glycan structures.

    Availability and implementation

    http://www.glycanstructure.org/glymdb.

    Contact

    wonpil@lehigh.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less