skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.


Search for: All records

Creators/Authors contains: "Martelli, ed., Pier Luigi"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Motivation

    Developing biochemical models in systems biology is a complex, knowledge-intensive activity. Some modelers (especially novices) benefit from model development tools with a graphical user interface. However, as with the development of complex software, text-based representations of models provide many benefits for advanced model development. At present, the tools for text-based model development are limited, typically just a textual editor that provides features such as copy, paste, find, and replace. Since these tools are not “model aware,” they do not provide features for: (i) model building such as autocompletion of species names; (ii) model analysis such as hover messages that provide information about chemical species; and (iii) model translation to convert between model representations. We refer to these as BAT features.

    Results

    We present VSCode-Antimony, a tool for building, analyzing, and translating models written in the Antimony modeling language, a human readable representation of Systems Biology Markup Language (SBML) models. VSCode-Antimony is a source editor, a tool with language-aware features. For example, there is autocompletion of variable names to assist with model building, hover messages that aid in model analysis, and translation between XML and Antimony representations of SBML models. These features result from making VSCode-Antimony model-aware by incorporating several sophisticated capabilities: analysis of the Antimony grammar (e.g. to identify model symbols and their types); a query system for accessing knowledge sources for chemical species and reactions; and automatic conversion between different model representations (e.g. between Antimony and SBML).

    Availability and implementation

    VSCode-Antimony is available as an open source extension in the VSCode Marketplace https://marketplace.visualstudio.com/items?itemName=stevem.vscode-antimony. Source code can be found at https://github.com/sys-bio/vscode-antimony.

     
    more » « less
  2. Abstract Motivation

    Accurately predicting the likelihood of interaction between two objects (compound–protein sequence, user–item, author–paper, etc.) is a fundamental problem in Computer Science. Current deep-learning models rely on learning accurate representations of the interacting objects. Importantly, relationships between the interacting objects, or features of the interaction, offer an opportunity to partition the data to create multi-views of the interacting objects. The resulting congruent and non-congruent views can then be exploited via contrastive learning techniques to learn enhanced representations of the objects.

    Results

    We present a novel method, Contrastive Stratification for Interaction Prediction (CSI), to stratify (partition) a dataset in a manner that can be exploited via Contrastive Multiview Coding to learn embeddings that maximize the mutual information across congruent data views. CSI assigns a key and multiple views to each data point, where data partitions under a particular key form congruent views of the data. We showcase the effectiveness of CSI by applying it to the compound–protein sequence interaction prediction problem, a pressing problem whose solution promises to expedite drug delivery (drug–protein interaction prediction), metabolic engineering, and synthetic biology (compound–enzyme interaction prediction) applications. Comparing CSI with a baseline model that does not utilize data stratification and contrastive learning, and show gains in average precision ranging from 13.7% to 39% using compounds and sequences as keys across multiple drug–target and enzymatic datasets, and gains ranging from 16.9% to 63% using reaction features as keys across enzymatic datasets.

    Availability and implementation

    Code and dataset available at https://github.com/HassounLab/CSI.

     
    more » « less
  3. Abstract Motivation

    Cell function is regulated by gene regulatory networks (GRNs) defined by protein-mediated interaction between constituent genes. Despite advances in experimental techniques, we can still measure only a fraction of the processes that govern GRN dynamics. To infer the properties of GRNs using partial observation, unobserved sequential processes can be replaced with distributed time delays, yielding non-Markovian models. Inference methods based on the resulting model suffer from the curse of dimensionality.

    Results

    We develop a simulation-based Bayesian MCMC method employing an approximate likelihood for the efficient and accurate inference of GRN parameters when only some of their products are observed. We illustrate our approach using a two-step activation model: an activation signal leads to the accumulation of an unobserved regulatory protein, which triggers the expression of observed fluorescent proteins. With prior information about observed fluorescent protein synthesis, our method successfully infers the dynamics of the unobserved regulatory protein. We can estimate the delay and kinetic parameters characterizing target regulation including transcription, translation, and target searching of an unobserved protein from experimental measurements of the products of its target gene. Our method is scalable and can be used to analyze non-Markovian models with hidden components.

    Availability and implementation

    Our code is implemented in R and is freely available with a simple example data at https://github.com/Mathbiomed/SimMCMC.

     
    more » « less
  4. Abstract Motivation

    The tertiary structures of an increasing number of biological macromolecules have been determined using cryo-electron microscopy (cryo-EM). However, there are still many cases where the resolution is not high enough to model the molecular structures with standard computational tools. If the resolution obtained is near the empirical borderline (3–4.5 Å), improvement in the map quality facilitates structure modeling.

    Results

    We report EM-GAN, a novel approach that modifies an input cryo-EM map to assist protein structure modeling. The method uses a 3D generative adversarial network (GAN) that has been trained on high- and low-resolution density maps to learn the density patterns, and modifies the input map to enhance its suitability for modeling. The method was tested extensively on a dataset of 65 EM maps in the resolution range of 3–6 Å and showed substantial improvements in structure modeling using popular protein structure modeling tools.

    Availability and implementation

    https://github.com/kiharalab/EM-GAN, Google Colab: https://tinyurl.com/3ccxpttx.

     
    more » « less
  5. Abstract Motivation

    While traditionally utilized for identifying site-specific metabolic activity within a compound to alter its interaction with a metabolizing enzyme, predicting the site-of-metabolism (SOM) is essential in analyzing the promiscuity of enzymes on substrates. The successful prediction of SOMs and the relevant promiscuous products has a wide range of applications that include creating extended metabolic models (EMMs) that account for enzyme promiscuity and the construction of novel heterologous synthesis pathways. There is therefore a need to develop generalized methods that can predict molecular SOMs for a wide range of metabolizing enzymes.

    Results

    This article develops a Graph Neural Network (GNN) model for the classification of an atom (or a bond) being an SOM. Our model, GNN-SOM, is trained on enzymatic interactions, available in the KEGG database, that span all enzyme commission numbers. We demonstrate that GNN-SOM consistently outperforms baseline machine learning models, when trained on all enzymes, on Cytochrome P450 (CYP) enzymes, or on non-CYP enzymes. We showcase the utility of GNN-SOM in prioritizing predicted enzymatic products due to enzyme promiscuity for two biological applications: the construction of EMMs and the construction of synthesis pathways.

    Availability and implementation

    A python implementation of the trained SOM predictor model can be found at https://github.com/HassounLab/GNN-SOM.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  6. Abstract Motivation

    Accurately representing biological networks in a low-dimensional space, also known as network embedding, is a critical step in network-based machine learning and is carried out widely using node2vec, an unsupervised method based on biased random walks. However, while many networks, including functional gene interaction networks, are dense, weighted graphs, node2vec is fundamentally limited in its ability to use edge weights during the biased random walk generation process, thus under-using all the information in the network.

    Results

    Here, we present node2vec+, a natural extension of node2vec that accounts for edge weights when calculating walk biases and reduces to node2vec in the cases of unweighted graphs or unbiased walks. Using two synthetic datasets, we empirically show that node2vec+ is more robust to additive noise than node2vec in weighted graphs. Then, using genome-scale functional gene networks to solve a wide range of gene function and disease prediction tasks, we demonstrate the superior performance of node2vec+ over node2vec in the case of weighted graphs. Notably, due to the limited amount of training data in the gene classification tasks, graph neural networks such as GCN and GraphSAGE are outperformed by both node2vec and node2vec+.

    Availability and implementation

    The data and code are available on GitHub at https://github.com/krishnanlab/node2vecplus_benchmarks. All additional data underlying this article are available on Zenodo at https://doi.org/10.5281/zenodo.7007164.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  7. Abstract Summary

    Low-complexity domains (LCDs) in proteins are regions enriched in a small subset of amino acids. LCDs exist in all domains of life, often have unusual biophysical behavior, and function in both normal and pathological processes. We recently developed an algorithm to identify LCDs based predominantly on amino acid composition thresholds. Here, we have integrated this algorithm with a webserver and augmented it with additional analysis options. Specifically, users can (i) search for LCDs in whole proteomes by setting minimum composition thresholds for individual or grouped amino acids, (ii) submit a known LCD sequence to search for similar LCDs, (iii) search for and plot LCDs within a single protein, (iv) statistically test for enrichment of LCDs within a user-provided protein set and (v) specifically identify proteins with multiple types of LCDs.

    Availability and implementation

    The LCD-Composer server can be accessed at http://lcd-composer.bmb.colostate.edu. The corresponding command-line scripts can be accessed at https://github.com/RossLabCSU/LCD-Composer/tree/master/WebserverScripts.

     
    more » « less
  8. Abstract Motivation

    The multispecies coalescent model is now widely accepted as an effective model for incorporating variation in the evolutionary histories of individual genes into methods for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent can be computationally expensive for large datasets, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and associated parameters, including speciation times and effective population sizes.

    Results

    We consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site-pattern probabilities can be computed under the assumption of a constant θ throughout the species tree. We demonstrate that the MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the non-parametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset for gibbons.

    Availability and implementation

    The method has been implemented in the PAUP* program, freely available at https://paup.phylosolutions.com for Macintosh, Windows and Linux operating systems.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  9. Abstract Motivation

    Protein structure prediction has been greatly improved by deep learning, but the contribution of different information is yet to be fully understood. This article studies the impacts of two kinds of information for structure prediction: template and multiple sequence alignment (MSA) embedding. Templates have been used by some methods before, such as AlphaFold2, RoseTTAFold and RaptorX. AlphaFold2 and RosetTTAFold only used templates detected by HHsearch, which may not perform very well on some targets. In addition, sequence embedding generated by pre-trained protein language models has not been fully explored for structure prediction. In this article, we study the impact of templates (including the number of templates, the template quality and how the templates are generated) on protein structure prediction accuracy, especially when the templates are detected by methods other than HHsearch. We also study the impact of sequence embedding (generated by MSATransformer and ESM-1b) on structure prediction.

    Results

    We have implemented a deep learning method for protein structure prediction that may take templates and MSA embedding as extra inputs. We study the contribution of templates and MSA embedding to structure prediction accuracy. Our experimental results show that templates can improve structure prediction on 71 of 110 CASP13 (13th Critical Assessment of Structure Prediction) targets and 47 of 91 CASP14 targets, and templates are particularly useful for targets with similar templates. MSA embedding can improve structure prediction on 63 of 91 CASP14 (14th Critical Assessment of Structure Prediction) targets and 87 of 183 CAMEO targets and is particularly useful for proteins with shallow MSAs. When both templates and MSA embedding are used, our method can predict correct folds (TMscore > 0.5) for 16 of 23 CASP14 FM targets and 14 of 18 Continuous Automated Model Evaluation (CAMEO) targets, outperforming RoseTTAFold by 5% and 7%, respectively.

    Availability and implementation

    Available at https://github.com/xluo233/RaptorXFold.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  10. Abstract Motivation

    Genome-wide maps of epigenetic modifications are powerful resources for non-coding genome annotation. Maps of multiple epigenetics marks have been integrated into cell or tissue type-specific chromatin state annotations for many cell or tissue types. With the increasing availability of multiple chromatin state maps for biologically similar samples, there is a need for methods that can effectively summarize the information about chromatin state annotations within groups of samples and identify differences across groups of samples at a high resolution.

    Results

    We developed CSREP, which takes as input chromatin state annotations for a group of samples. CSREP then probabilistically estimates the state at each genomic position and derives a representative chromatin state map for the group. CSREP uses an ensemble of multi-class logistic regression classifiers that predict the chromatin state assignment of each sample given the state maps from all other samples. The difference in CSREP’s probability assignments for the two groups can be used to identify genomic locations with differential chromatin state assignments. Using groups of chromatin state maps of a diverse set of cell and tissue types, we demonstrate the advantages of using CSREP to summarize chromatin state maps and identify biologically relevant differences between groups at a high resolution.

    Availability and implementation

    The CSREP source code and generated data are available at http://github.com/ernstlab/csrep.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less