skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Accurate and efficient gene function prediction using a multi-bacterial network
Abstract Motivation Nearly 40% of the genes in sequenced genomes have no experimentally or computationally derived functional annotations. To fill this gap, we seek to develop methods for network-based gene function prediction that can integrate heterogeneous data for multiple species with experimentally based functional annotations and systematically transfer them to newly sequenced organisms on a genome-wide scale. However, the large sizes of such networks pose a challenge for the scalability of current methods. Results We develop a label propagation algorithm called FastSinkSource. By formally bounding its rate of progress, we decrease the running time by a factor of 100 without sacrificing accuracy. We systematically evaluate many approaches to construct multi-species bacterial networks and apply FastSinkSource and other state-of-the-art methods to these networks. We find that the most accurate and efficient approach is to pre-compute annotation scores for species with experimental annotations, and then to transfer them to other organisms. In this manner, FastSinkSource runs in under 3 min for 200 bacterial species. Availability and implementation An implementation of our framework and all data used in this research are available at https://github.com/Murali-group/multi-species-GOA-prediction. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1759858 1817736 1617678
PAR ID:
10279602
Author(s) / Creator(s):
; ;
Editor(s):
Lenore, Cowen
Date Published:
Journal Name:
Bioinformatics
Volume:
37
Issue:
6
ISSN:
1367-4803
Page Range / eLocation ID:
800 to 806
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mulder, Nicola (Ed.)
    Abstract Motivation Leveraging cross-species information in protein function prediction can add significant power to network-based protein function prediction methods, because so much functional information is conserved across at least close scales of evolution. We introduce MUNDO, a new cross-species co-embedding method that combines a single-network embedding method with a co-embedding method to predict functional annotations in a target species, leveraging also functional annotations in a model species network. Results Across a wide range of parameter choices, MUNDO performs best at predicting annotations in the mouse network, when trained on mouse and human protein–protein interaction (PPI) networks, in the human network, when trained on human and mouse PPIs, and in Baker’s yeast, when trained on Fission and Baker’s yeast, as compared to competitor methods. MUNDO also outperforms all the cross-species methods when predicting in Fission yeast when trained on Fission and Baker’s yeast; however, in this single case, discarding the information from the other species and using annotations from the Fission yeast network alone usually performs best. Availability and implementation All code is available and can be accessed here: github.com/v0rtex20k/MUNDO. Supplementary information Supplementary data are available at Bioinformatics Advances online. Additional experimental results are on our github site. 
    more » « less
  2. Martelli, Pier Luigi (Ed.)
    Abstract Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability and implementation The code is freely available at https://github.com/nowittynamesleft/NetQuilt. The data, including sequences, PPI networks and GO annotations are available at https://string-db.org/. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract SummaryComputational methods to predict protein–protein interaction (PPI) typically segregate into sequence-based ‘bottom-up’ methods that infer properties from the characteristics of the individual protein sequences, or global ‘top-down’ methods that infer properties from the pattern of already known PPIs in the species of interest. However, a way to incorporate top-down insights into sequence-based bottom-up PPI prediction methods has been elusive. We thus introduce Topsy-Turvy, a method that newly synthesizes both views in a sequence-based, multi-scale, deep-learning model for PPI prediction. While Topsy-Turvy makes predictions using only sequence data, during the training phase it takes a transfer-learning approach by incorporating patterns from both global and molecular-level views of protein interaction. In a cross-species context, we show it achieves state-of-the-art performance, offering the ability to perform genome-scale, interpretable PPI prediction for non-model organisms with no existing experimental PPI data. In species with available experimental PPI data, we further present a Topsy-Turvy hybrid (TT-Hybrid) model which integrates Topsy-Turvy with a purely network-based model for link prediction that provides information about species-specific network rewiring. TT-Hybrid makes accurate predictions for both well- and sparsely-characterized proteins, outperforming both its constituent components as well as other state-of-the-art PPI prediction methods. Furthermore, running Topsy-Turvy and TT-Hybrid screens is feasible for whole genomes, and thus these methods scale to settings where other methods (e.g. AlphaFold-Multimer) might be infeasible. The generalizability, accuracy and genome-level scalability of Topsy-Turvy and TT-Hybrid unlocks a more comprehensive map of protein interaction and organization in both model and non-model organisms. Availability and implementationhttps://topsyturvy.csail.mit.edu. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  4. Schwartz, Russell (Ed.)
    Abstract Motivation Identification and interpretation of non-coding variations that affect disease risk remain a paramount challenge in genome-wide association studies (GWAS) of complex diseases. Experimental efforts have provided comprehensive annotations of functional elements in the human genome. On the other hand, advances in computational biology, especially machine learning approaches, have facilitated accurate predictions of cell-type-specific functional annotations. Integrating functional annotations with GWAS signals has advanced the understanding of disease mechanisms. In previous studies, functional annotations were treated as static of a genomic region, ignoring potential functional differences imposed by different genotypes across individuals. Results We develop a computational approach, Openness Weighted Association Studies (OWAS), to leverage and aggregate predictions of chromosome accessibility in personal genomes for prioritizing GWAS signals. The approach relies on an analytical expression we derived for identifying disease associated genomic segments whose effects in the etiology of complex diseases are evaluated. In extensive simulations and real data analysis, OWAS identifies genes/segments that explain more heritability than existing methods, and has a better replication rate in independent cohorts than GWAS. Moreover, the identified genes/segments show tissue-specific patterns and are enriched in disease relevant pathways. We use rheumatic arthritis and asthma as examples to demonstrate how OWAS can be exploited to provide novel insights on complex diseases. Availability and implementation The R package OWAS that implements our method is available at https://github.com/shuangsong0110/OWAS. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. Abstract Motivation Graph embedding learning that aims to automatically learn low-dimensional node representations, has drawn increasing attention in recent years. To date, most recent graph embedding methods are evaluated on social and information networks and are not comprehensively studied on biomedical networks under systematic experiments and analyses. On the other hand, for a variety of biomedical network analysis tasks, traditional techniques such as matrix factorization (which can be seen as a type of graph embedding methods) have shown promising results, and hence there is a need to systematically evaluate the more recent graph embedding methods (e.g. random walk-based and neural network-based) in terms of their usability and potential to further the state-of-the-art. Results We select 11 representative graph embedding methods and conduct a systematic comparison on 3 important biomedical link prediction tasks: drug-disease association (DDA) prediction, drug–drug interaction (DDI) prediction, protein–protein interaction (PPI) prediction; and 2 node classification tasks: medical term semantic type classification, protein function prediction. Our experimental results demonstrate that the recent graph embedding methods achieve promising results and deserve more attention in the future biomedical graph analysis. Compared with three state-of-the-art methods for DDAs, DDIs and protein function predictions, the recent graph embedding methods achieve competitive performance without using any biological features and the learned embeddings can be treated as complementary representations for the biological features. By summarizing the experimental results, we provide general guidelines for properly selecting graph embedding methods and setting their hyper-parameters for different biomedical tasks. Availability and implementation As part of our contributions in the paper, we develop an easy-to-use Python package with detailed instructions, BioNEV, available at: https://github.com/xiangyue9607/BioNEV, including all source code and datasets, to facilitate studying various graph embedding methods on biomedical tasks. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less