Topsy-Turvy: integrating a global view into sequence-based PPI prediction
Abstract Summary

Computational methods to predict protein–protein interaction (PPI) typically segregate into sequence-based ‘bottom-up’ methods that infer properties from the characteristics of the individual protein sequences, or global ‘top-down’ methods that infer properties from the pattern of already known PPIs in the species of interest. However, a way to incorporate top-down insights into sequence-based bottom-up PPI prediction methods has been elusive. We thus introduce Topsy-Turvy, a method that newly synthesizes both views in a sequence-based, multi-scale, deep-learning model for PPI prediction. While Topsy-Turvy makes predictions using only sequence data, during the training phase it takes a transfer-learning approach by incorporating patterns from both global and molecular-level views of protein interaction. In a cross-species context, we show it achieves state-of-the-art performance, offering the ability to perform genome-scale, interpretable PPI prediction for non-model organisms with no existing experimental PPI data. In species with available experimental PPI data, we further present a Topsy-Turvy hybrid (TT-Hybrid) model which integrates Topsy-Turvy with a purely network-based model for link prediction that provides information about species-specific network rewiring. TT-Hybrid makes accurate predictions for both well- and sparsely-characterized proteins, outperforming both its constituent components as well as other state-of-the-art PPI prediction methods. Furthermore, running Topsy-Turvy and TT-Hybrid screens is more »

Availability and implementation

https://topsyturvy.csail.mit.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

Authors:
; ; ; ;
Award ID(s):
Publication Date:
NSF-PAR ID:
10368340
Journal Name:
Bioinformatics
Volume:
38
Issue:
Supplement_1
Page Range or eLocation-ID:
p. i264-i272
ISSN:
1367-4803
Publisher:
Oxford University Press
National Science Foundation
##### More Like this
1. Abstract Background

Protein–protein interaction (PPI) is vital for life processes, disease treatment, and drug discovery. The computational prediction of PPI is relatively inexpensive and efficient when compared to traditional wet-lab experiments. Given a new protein, one may wish to find whether the protein has any PPI relationship with other existing proteins. Current computational PPI prediction methods usually compare the new protein to existing proteins one by one in a pairwise manner. This is time consuming.

Results

In this work, we propose a more efficient model, called deep hash learning protein-and-protein interaction (DHL-PPI), to predict all-against-all PPI relationships in a database of proteins. First, DHL-PPI encodes a protein sequence into a binary hash code based on deep features extracted from the protein sequences using deep learning techniques. This encoding scheme enables us to turn the PPI discrimination problem into a much simpler searching problem. The binary hash code for a protein sequence can be regarded as a number. Thus, in the pre-screening stage of DHL-PPI, the string matching problem of comparing a protein sequence against a database withMproteins can be transformed into a much more simpler problem: to find a number inside a sorted array of lengthM. This pre-screening process narrows down themore »

Conclusions

The experimental results confirmed that DHL-PPI is feasible and effective. Using a dataset with strictly negative PPI examples of four species, DHL-PPI is shown to be superior or competitive when compared to the other state-of-the-art methods in terms of precision, recall or F1 score. Furthermore, in the prediction stage, the proposed DHL-PPI reduced the time complexity from$$O(M^2)$$$O\left({M}^{2}\right)$to$$O(M\log M)$$$O\left(MlogM\right)$for performing an all-against-all PPI prediction for a database withMproteins. With the proposed approach, a protein database can be preprocessed and stored for later search using the proposed encoding scheme. This can provide a more efficient way to cope with the rapidly increasing volume of protein datasets.

2. Abstract Motivation

As an increasing amount of protein–protein interaction (PPI) data becomes available, their computational interpretation has become an important problem in bioinformatics. The alignment of PPI networks from different species provides valuable information about conserved subnetworks, evolutionary pathways and functional orthologs. Although several methods have been proposed for global network alignment, there is a pressing need for methods that produce more accurate alignments in terms of both topological and functional consistency.

Results

In this work, we present a novel global network alignment algorithm, named ModuleAlign, which makes use of local topology information to define a module-based homology score. Based on a hierarchical clustering of functionally coherent proteins involved in the same module, ModuleAlign employs a novel iterative scheme to find the alignment between two networks. Evaluated on a diverse set of benchmarks, ModuleAlign outperforms state-of-the-art methods in producing functionally consistent alignments. By aligning Pathogen–Human PPI networks, ModuleAlign also detects a novel set of conserved human genes that pathogens preferentially target to cause pathogenesis.

Availability

http://ttic.uchicago.edu/∼hashemifar/ModuleAlign.html

Contact

canzar@ttic.edu or j3xu.ttic.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

3. Abstract Motivation

Computational methods for compound–protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound–protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound–protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.

Results

To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.

Availability and implementation

Data and source codes are available at https://github.com/Shen-Lab/CPAC.

Supplementary information

Supplementary data aremore »

4. (Ed.)
Abstract Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in themore »
5. Abstract Motivation

Most proteins perform their biological functions through interactions with other proteins in cells. Amino acid mutations, especially those occurring at protein interfaces, can change the stability of protein–protein interactions (PPIs) and impact their functions, which may cause various human diseases. Quantitative estimation of the binding affinity changes (ΔΔGbind) caused by mutations can provide critical information for protein function annotation and genetic disease diagnoses.

Results

We present SSIPe, which combines protein interface profiles, collected from structural and sequence homology searches, with a physics-based energy function for accurate ΔΔGbind estimation. To offset the statistical limits of the PPI structure and sequence databases, amino acid-specific pseudocounts were introduced to enhance the profile accuracy. SSIPe was evaluated on large-scale experimental data containing 2204 mutations from 177 proteins, where training and test datasets were stringently separated with the sequence identity between proteins from the two datasets below 30%. The Pearson correlation coefficient between estimated and experimental ΔΔGbind was 0.61 with a root-mean-square-error of 1.93 kcal/mol, which was significantly better than the other methods. Detailed data analyses revealed that the major advantage of SSIPe over other traditional approaches lies in the novel combination of the physical energy function with the new knowledge-based interface profile. SSIPe also considerablymore »

Availability and implementation

Web-server/standalone program, source code and datasets are freely available at https://zhanglab.ccmb.med.umich.edu/SSIPe and https://github.com/tommyhuangthu/SSIPe.

Supplementary information

Supplementary data are available at Bioinformatics online.