Abstract Proteins' flexibility is a feature in communicating changes in cell signaling instigated by binding with secondary messengers, such as calcium ions, associated with the coordination of muscle contraction, neurotransmitter release, and gene expression. When binding with the disordered parts of a protein, calcium ions must balance their charge states with the shape of calcium‐binding proteins and their versatile pool of partners depending on the circumstances they transmit. Accurately determining the ionic charges of those ions is essential for understanding their role in such processes. However, it is unclear whether the limited experimental data available can be effectively used to train models to accurately predict the charges of calcium‐binding protein variants. Here, we developed a chemistry‐informed, machine‐learning algorithm that implements a game theoretic approach to explain the output of a machine‐learning model without the prerequisite of an excessively large database for high‐performance prediction of atomic charges. We used the ab initio electronic structure data representing calcium ions and the structures of the disordered segments of calcium‐binding peptides with surrounding water molecules to train several explainable models. Network theory was used to extract the topological features of atomic interactions in the structurally complex data dictated by the coordination chemistry of a calcium ion, a potent indicator of its charge state in protein. Our design created a computational tool of CaXML, which provided a framework of explainable machine learning model to annotate ionic charges of calcium ions in calcium‐binding proteins in response to the chemical changes in an environment. Our framework will provide new insights into protein design for engineering functionality based on the limited size of scientific data in a genome space.
more »
« less
Explainable Machine Learning Model to Accurately Predict Protein-Binding Peptides
Enzymes play key roles in the biological functions of living organisms, which serve as catalysts to and regulate biochemical reaction pathways. Recent studies suggest that peptides are promising molecules for modulating enzyme function due to their advantages in large chemical diversity and well-established methods for library synthesis. Experimental approaches to identify protein-binding peptides are time-consuming and costly. Hence, there is a demand to develop a fast and accurate computational approach to tackle this problem. Another challenge in developing a computational approach is the lack of a large and reliable dataset. In this study, we develop a new machine learning approach called PepBind-SVM to predict protein-binding peptides. To build this model, we extract different sequential and physicochemical features from peptides and use a Support Vector Machine (SVM) as the classification technique. We train this model on the dataset that we also introduce in this study. PepBind-SVM achieves 92.1% prediction accuracy, outperforming other classifiers at predicting protein-binding peptides.
more »
« less
- Award ID(s):
- 2152059
- PAR ID:
- 10620702
- Publisher / Repository:
- MDPI
- Date Published:
- Journal Name:
- Algorithms
- Volume:
- 17
- Issue:
- 9
- ISSN:
- 1999-4893
- Page Range / eLocation ID:
- 409
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Cardiovascular diseases (CVDs) are the leading cause of death worldwide and are heavily influenced by genetic factors. Genome-wide association studies have mapped >90% of CVD-associated variants within the noncoding genome, which can alter the function of regulatory proteins, such as transcription factors (TFs). However, due to the overwhelming number of single-nucleotide polymorphisms (SNPs) (>500,000) in genome-wide association studies, prioritizing variants for in vitro analysis remains challenging. In this work, we implemented a computational approach that considers support vector machine (SVM)-based TF binding site classification and cardiac expression quantitative trait loci (eQTL) analysis to identify and prioritize potential CVD-causing SNPs. We identified 1535 CVD-associated SNPs within TF footprints and putative cardiac enhancers plus 14,218 variants in linkage disequilibrium with genotype-dependent gene expression in cardiac tissues. Using ChIP-seq data from two cardiac TFs (NKX2-5 and TBX5) in human-induced pluripotent stem cell-derived cardiomyocytes, we trained a large-scale gapped k-mer SVM model to identify CVD-associated SNPs that altered NKX2-5 and TBX5 binding. The model was tested by scoring human heart TF genomic footprints within putative enhancers and measuring in vitro binding through electrophoretic mobility shift assay. Five variants predicted to alter NKX2-5 (rs59310144, rs6715570, and rs61872084) and TBX5 (rs7612445 and rs7790964) binding were prioritized for in vitro validation based on the magnitude of the predicted change in binding and are in cardiac tissue eQTLs. All five variants altered NKX2-5 and TBX5 DNA binding. We present a bioinformatic approach that considers tissue-specific eQTL analysis and SVM-based TF binding site classification to prioritize CVD-associated variants for in vitro analysis.more » « less
-
Elofsson, Arne (Ed.)Abstract Motivation Procedures for structural modeling of protein–protein complexes (protein docking) produce a number of models which need to be further analyzed and scored. Scoring can be based on independently determined constraints on the structure of the complex, such as knowledge of amino acids essential for the protein interaction. Previously, we showed that text mining of residues in freely available PubMed abstracts of papers on studies of protein–protein interactions may generate such constraints. However, absence of post-processing of the spotted residues reduced usability of the constraints, as a significant number of the residues were not relevant for the binding of the specific proteins. Results We explored filtering of the irrelevant residues by two machine learning approaches, Deep Recursive Neural Network (DRNN) and Support Vector Machine (SVM) models with different training/testing schemes. The results showed that the DRNN model is superior to the SVM model when training is performed on the PMC-OA full-text articles and applied to classification (interface or non-interface) of the residues spotted in the PubMed abstracts. When both training and testing is performed on full-text articles or on abstracts, the performance of these models is similar. Thus, in such cases, there is no need to utilize computationally demanding DRNN approach, which is computationally expensive especially at the training stage. The reason is that SVM success is often determined by the similarity in data/text patterns in the training and the testing sets, whereas the sentence structures in the abstracts are, in general, different from those in the full text articles. Availabilityand implementation The code and the datasets generated in this study are available at https://gitlab.ku.edu/vakser-lab-public/text-mining/-/tree/2020-09-04. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
The controlled formation of nanoparticles with optimum characteristics and functional aspects has proven successful via peptide-mediated nanoparticle synthesis. However, the effects of the peptide sequence and binding motif on surface features and physicochemical properties of nanoparticles are not well-understood. In this study, we investigate in a comparative manner how a specific peptide known as Pd4 and its two known variants may form nanoparticles both in an isolated state and when attached to a green fluorescent protein (GFPuv). More importantly, we introduce a novel computational approach to predict the trend of the size and activity of the peptide-directed nanoparticles by estimating the binding affinity of the peptide to a single ion. We used molecular dynamics (MD) simulations to explore the differential behavior of the isolated and GFP-fused peptides and their mutants. Our computed palladium (Pd) binding free energies match the typical nanoparticle sizes reported from transmission electron microscope pictures. Stille coupling and Suzuki–Miyaura reaction turnover frequencies (TOFs) also correspond with computationally predicted Pd binding affinities. The results show that while using Pd4 and its two known variants (A6 and A11) in isolation produces nanoparticles of varying sizes, fusing these peptides to the GFPuv protein produces nanoparticles of similar sizes and activity. In other words, GFPuv reduces the sensitivity of the nanoparticles to the peptide sequence. This study provides a computational framework for designing free and protein-attached peptides that helps in the synthesis of nanoparticles with well-regulated properties.more » « less
-
Alpha-synuclein (ASyn) is a protein that is known to play a critical role in Parkinson’s disease (PD) due to its propensity for misfolding and aggregation. Furthermore, this process leads to oxidative stress and the formation of free radicals that cause neuronal damage. In this study, we have utilized a biomimetic approach to design new peptides derived from marine natural resources. The peptides were designed using a peptide scrambling approach where antioxidant moieties were combined with fibrillary inhibition motifs in order to design peptides that would have a dual targeting effect on ASyn misfolding. Of the 20 designed peptides, 12 were selected for examining binding interactions through molecular docking and molecular dynamics approaches, which revealed that the peptides were binding to the pre-NAC and NAC (non-amyloid component) domain residues such as Tyr39, Asn65, Gly86, and Ala85, among others. Because ASyn filaments derived from Lewy body dementia (LBD) have a different secondary structure compared to pathogenic ASyn fibrils, both forms were tested computationally. Five of those peptides were utilized for laboratory validation based on those results. The binding interactions with fibrils were confirmed using surface plasmon resonance studies, where EQALMPWIWYWKDPNGS, PYYYWKDPNGS, and PYYYWKELAQM showed higher binding. Secondary structural analyses revealed their ability to induce conformational changes in ASyn fibrils. Additionally, PYYYWKDPNGS and PYYYWKELAQM also demonstrated antioxidant properties. This study provides insight into the binding interactions of varying forms of ASyn implicated in PD. The peptides may be further investigated for mitigating fibrillation at the cellular level and may have the potential to target ASyn.more » « less
An official website of the United States government

