skip to main content

Title: Interpretable network propagation with application to expanding the repertoire of human proteins that interact with SARS-CoV-2
Abstract Background

Network propagation has been widely used for nearly 20 years to predict gene functions and phenotypes. Despite the popularity of this approach, little attention has been paid to the question of provenance tracing in this context, e.g., determining how much any experimental observation in the input contributes to the score of every prediction.


We design a network propagation framework with 2 novel components and apply it to predict human proteins that directly or indirectly interact with SARS-CoV-2 proteins. First, we trace the provenance of each prediction to its experimentally validated sources, which in our case are human proteins experimentally determined to interact with viral proteins. Second, we design a technique that helps to reduce the manual adjustment of parameters by users. We find that for every top-ranking prediction, the highest contribution to its score arises from a direct neighbor in a human protein-protein interaction network. We further analyze these results to develop functional insights on SARS-CoV-2 that expand on known biology such as the connection between endoplasmic reticulum stress, HSPA5, and anti-clotting agents.


We examine how our provenance-tracing method can be generalized to a broad class of network-based algorithms. We provide a useful resource for the SARS-CoV-2 community that implicates many previously undocumented proteins with putative functional relationships to viral infection. This resource includes potential drugs that can be opportunistically repositioned to target these proteins. We also discuss how our overall framework can be extended to other, newly emerging viruses.

more » « less
Award ID(s):
1759858 1817736 2029543
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Establishing the host range for novel viruses remains a challenge. Here, we address the challenge of identifying non-human animal coronaviruses that may infect humans by creating an artificial neural network model that learns from spike protein sequences of alpha and beta coronaviruses and their binding annotation to their host receptor. The proposed method produces a human-Binding Potential (h-BiP) score that distinguishes, with high accuracy, the binding potential among coronaviruses. Three viruses, previously unknown to bind human receptors, were identified: Bat coronavirus BtCoV/133/2005 and Pipistrellus abramus bat coronavirus HKU5-related (both MERS related viruses), andRhinolophus affiniscoronavirus isolate LYRa3 (a SARS related virus). We further analyze the binding properties of BtCoV/133/2005 and LYRa3 using molecular dynamics. To test whether this model can be used for surveillance of novel coronaviruses, we re-trained the model on a set that excludes SARS-CoV-2 and all viral sequences released after the SARS-CoV-2 was published. The results predict the binding of SARS-CoV-2 with a human receptor, indicating that machine learning methods are an excellent tool for the prediction of host expansion events.

    more » « less
  2. Abstract

    The COVID-19 pandemic, caused by the coronavirus SARS-CoV-2, has resulted in the loss of millions of lives and severe global economic consequences. Every time SARS-CoV-2 replicates, the viruses acquire new mutations in their genomes. Mutations in SARS-CoV-2 genomes led to increased transmissibility, severe disease outcomes, evasion of the immune response, changes in clinical manifestations and reducing the efficacy of vaccines or treatments. To date, the multiple resources provide lists of detected mutations without key functional annotations. There is a lack of research examining the relationship between mutations and various factors such as disease severity, pathogenicity, patient age, patient gender, cross-species transmission, viral immune escape, immune response level, viral transmission capability, viral evolution, host adaptability, viral protein structure, viral protein function, viral protein stability and concurrent mutations. Deep understanding the relationship between mutation sites and these factors is crucial for advancing our knowledge of SARS-CoV-2 and for developing effective responses. To fill this gap, we built COV2Var, a function annotation database of SARS-CoV-2 genetic variation, available at COV2Var aims to identify common mutations in SARS-CoV-2 variants and assess their effects, providing a valuable resource for intensive functional annotations of common mutations among SARS-CoV-2 variants.

    more » « less
  3. Abstract Predicting protein properties from amino acid sequences is an important problem in biology and pharmacology. Protein–protein interactions among SARS-CoV-2 spike protein, human receptors and antibodies are key determinants of the potency of this virus and its ability to evade the human immune response. As a rapidly evolving virus, SARS-CoV-2 has already developed into many variants with considerable variation in virulence among these variants. Utilizing the proteomic data of SARS-CoV-2 to predict its viral characteristics will, therefore, greatly aid in disease control and prevention. In this paper, we review and compare recent successful prediction methods based on long short-term memory (LSTM), transformer, convolutional neural network (CNN) and a similarity-based topological regression (TR) model and offer recommendations about appropriate predictive methodology depending on the similarity between training and test datasets. We compare the effectiveness of these models in predicting the binding affinity and expression of SARS-CoV-2 spike protein sequences. We also explore how effective these predictive methods are when trained on laboratory-created data and are tasked with predicting the binding affinity of the in-the-wild SARS-CoV-2 spike protein sequences obtained from the GISAID datasets. We observe that TR is a better method when the sample size is small and test protein sequences are sufficiently similar to the training sequence. However, when the training sample size is sufficiently large and prediction requires extrapolation, LSTM embedding and CNN-based predictive model show superior performance. 
    more » « less
  4. Abstract

    The rampant spread of COVID-19, an infectious disease caused by SARS-CoV-2, all over the world has led to over millions of deaths, and devastated the social, financial and political entities around the world. Without an existing effective medical therapy, vaccines are urgently needed to avoid the spread of this disease. In this study, we propose an in silico deep learning approach for prediction and design of a multi-epitope vaccine (DeepVacPred). By combining the in silico immunoinformatics and deep neural network strategies, the DeepVacPred computational framework directly predicts 26 potential vaccine subunits from the available SARS-CoV-2 spike protein sequence. We further use in silico methods to investigate the linear B-cell epitopes, Cytotoxic T Lymphocytes (CTL) epitopes, Helper T Lymphocytes (HTL) epitopes in the 26 subunit candidates and identify the best 11 of them to construct a multi-epitope vaccine for SARS-CoV-2 virus. The human population coverage, antigenicity, allergenicity, toxicity, physicochemical properties and secondary structure of the designed vaccine are evaluated via state-of-the-art bioinformatic approaches, showing good quality of the designed vaccine. The 3D structure of the designed vaccine is predicted, refined and validated by in silico tools. Finally, we optimize and insert the codon sequence into a plasmid to ensure the cloning and expression efficiency. In conclusion, this proposed artificial intelligence (AI) based vaccine discovery framework accelerates the vaccine design process and constructs a 694aa multi-epitope vaccine containing 16 B-cell epitopes, 82 CTL epitopes and 89 HTL epitopes, which is promising to fight the SARS-CoV-2 viral infection and can be further evaluated in clinical studies. Moreover, we trace the RNA mutations of the SARS-CoV-2 and ensure that the designed vaccine can tackle the recent RNA mutations of the virus.

    more » « less
  5. Abstract

    The Papain-like protease (PLpro) is a domain of a multi-functional, non-structural protein 3 of coronaviruses. PLpro cleaves viral polyproteins and posttranslational conjugates with poly-ubiquitin and protective ISG15, composed of two ubiquitin-like (UBL) domains. Across coronaviruses, PLpro showed divergent selectivity for recognition and cleavage of posttranslational conjugates despite sequence conservation. We show that SARS-CoV-2 PLpro binds human ISG15 and K48-linked di-ubiquitin (K48-Ub2) with nanomolar affinity and detect alternate weaker-binding modes. Crystal structures of untethered PLpro complexes with ISG15 and K48-Ub2combined with solution NMR and cross-linking mass spectrometry revealed how the two domains of ISG15 or K48-Ub2are differently utilized in interactions with PLpro. Analysis of protein interface energetics predicted differential binding stabilities of the two UBL/Ub domains that were validated experimentally. We emphasize how substrate recognition can be tuned to cleave specifically ISG15 or K48-Ub2modifications while retaining capacity to cleave mono-Ub conjugates. These results highlight alternative druggable surfaces that would inhibit PLpro function.

    more » « less