skip to main content


Title: Machine learning empowers phosphoproteome prediction in cancers
Abstract Motivation

Reversible protein phosphorylation is an essential post-translational modification regulating protein functions and signaling pathways in many cellular processes. Aberrant activation of signaling pathways often contributes to cancer development and progression. The mass spectrometry-based phosphoproteomics technique is a powerful tool to investigate the site-level phosphorylation of the proteome in a global fashion, paving the way for understanding the regulatory mechanisms underlying cancers. However, this approach is time-consuming and requires expensive instruments, specialized expertise and a large amount of starting material. An alternative in silico approach is predicting the phosphoproteomic profiles of cancer patients from the available proteomic, transcriptomic and genomic data.

Results

Here, we present a winning algorithm in the 2017 NCI-CPTAC DREAM Proteogenomics Challenge for predicting phosphorylation levels of the proteome across cancer patients. We integrate four components into our algorithm, including (i) baseline correlations between protein and phosphoprotein abundances, (ii) universal protein–protein interactions, (iii) shareable regulatory information across cancer tissues and (iv) associations among multi-phosphorylation sites of the same protein. When tested on a large held-out testing dataset of 108 breast and 62 ovarian cancer samples, our method ranked first in both cancer tissues, demonstrating its robustness and generalization ability.

Availability and implementation

Our code and reproducible results are freely available on GitHub: https://github.com/GuanLab/phosphoproteome_prediction.

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
NSF-PAR ID:
10114674
Author(s) / Creator(s):
 ;  ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
ISSN:
1367-4803
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    Alternative RNA splicing is widely dysregulated in cancers including lung adenocarcinoma, where aberrant splicing events are frequently caused by somatic splice site mutations or somatic mutations of splicing factor genes. However, the majority of mis-splicing in cancers is unexplained by these known mechanisms. We hypothesize that the aberrant Ras signaling characteristic of lung cancers plays a role in promoting the alternative splicing observed in tumors.

    Methods

    We recently performed transcriptome and proteome profiling of human lung epithelial cells ectopically expressing oncogenic KRAS and another cancer-associated Ras GTPase, RIT1. Unbiased analysis of phosphoproteome data identified altered splicing factor phosphorylation in KRAS-mutant cells, so we performed differential alternative splicing analysis using rMATS to identify significantly altered isoforms in lung epithelial cells. To determine whether these isoforms were uniquely regulated by KRAS, we performed a large-scale splicing screen in which we generated over 300 unique RNA sequencing profiles of isogenic A549 lung adenocarcinoma cells ectopically expressing 75 different wild-type or variant alleles across 28 genes implicated in lung cancer.

    Results

    Mass spectrometry data showed widespread downregulation of splicing factor phosphorylation in lung epithelial cells expressing mutant KRAS compared to cells expressing wild-type KRAS. We observed alternative splicing in the same cells, with 2196 and 2416 skipped exon events in KRASG12Vand KRASQ61Hcells, respectively, 997 of which were shared (p < 0.001 by hypergeometric test). In the high-throughput splicing screen, mutant KRAS induced the greatest number of differential alternative splicing events, second only to the RNA binding protein RBM45 and its variant RBM45M126I. We identified ten high confidence cassette exon events across multiple KRAS variants and cell lines. These included differential splicing of the Myc Associated Zinc Finger (MAZ). As MAZ regulates expression of KRAS, this splice variant may be a mechanism for the cell to modulate wild-type KRAS levels in the presence of oncogenic KRAS.

    Conclusion

    Proteomic and transcriptomic profiling of lung epithelial cells uncovered splicing factor phosphorylation and mRNA splicing events regulated by oncogenic KRAS. These data suggest that in addition to widespread transcriptional changes, the Ras signaling pathway can promote post-transcriptional splicing changes that may contribute to oncogenic processes.

     
    more » « less
  2. Abstract Motivation

    Gene regulatory networks (GRNs) of the same organism can be different under different conditions, although the overall network structure may be similar. Understanding the difference in GRNs under different conditions is important to understand condition-specific gene regulation. When gene expression and other relevant data under two different conditions are available, they can be used by an existing network inference algorithm to estimate two GRNs separately, and then to identify the difference between the two GRNs. However, such an approach does not exploit the similarity in two GRNs, and may sacrifice inference accuracy.

    Results

    In this paper, we model GRNs with the structural equation model (SEM) that can integrate gene expression and genetic perturbation data, and develop an algorithm named fused sparse SEM (FSSEM), to jointly infer GRNs under two conditions, and then to identify difference of the two GRNs. Computer simulations demonstrate that the FSSEM algorithm outperforms the approaches that estimate two GRNs separately. Analysis of a dataset of lung cancer and another dataset of gastric cancer with FSSEM inferred differential GRNs in cancer versus normal tissues, whose genes with largest network degrees have been reported to be implicated in tumorigenesis. The FSSEM algorithm provides a valuable tool for joint inference of two GRNs and identification of the differential GRN under two conditions.

    Availability and implementation

    The R package fssemR implementing the FSSEM algorithm is available at https://github.com/Ivis4ml/fssemR.git. It is also available on CRAN.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Motivation

    Kinase-regulated phosphorylation is a ubiquitous type of post-translational modification (PTM) in both eukaryotic and prokaryotic cells. Phosphorylation plays fundamental roles in many signalling pathways and biological processes, such as protein degradation and protein-protein interactions. Experimental studies have revealed that signalling defects caused by aberrant phosphorylation are highly associated with a variety of human diseases, especially cancers. In light of this, a number of computational methods aiming to accurately predict protein kinase family-specific or kinase-specific phosphorylation sites have been established, thereby facilitating phosphoproteomic data analysis.

    Results

    In this work, we present Quokka, a novel bioinformatics tool that allows users to rapidly and accurately identify human kinase family-regulated phosphorylation sites. Quokka was developed by using a variety of sequence scoring functions combined with an optimized logistic regression algorithm. We evaluated Quokka based on well-prepared up-to-date benchmark and independent test datasets, curated from the Phospho.ELM and UniProt databases, respectively. The independent test demonstrates that Quokka improves the prediction performance compared with state-of-the-art computational tools for phosphorylation prediction. In summary, our tool provides users with high-quality predicted human phosphorylation sites for hypothesis generation and biological validation.

    Availability and implementation

    The Quokka webserver and datasets are freely available at http://quokka.erc.monash.edu/.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    The analysis of high-dimensional ‘omics data is often informed by the use of biological interaction networks. For example, protein–protein interaction networks have been used to analyze gene expression data, to prioritize germline variants, and to identify somatic driver mutations in cancer. In these and other applications, the underlying computational problem is to identify altered subnetworks containing genes that are both highly altered in an ‘omics dataset and are topologically close (e.g. connected) on an interaction network.

    Results

    We introduce Hierarchical HotNet, an algorithm that finds a hierarchy of altered subnetworks. Hierarchical HotNet assesses the statistical significance of the resulting subnetworks over a range of biological scales and explicitly controls for ascertainment bias in the network. We evaluate the performance of Hierarchical HotNet and several other algorithms that identify altered subnetworks on the problem of predicting cancer genes and significantly mutated subnetworks. On somatic mutation data from The Cancer Genome Atlas, Hierarchical HotNet outperforms other methods and identifies significantly mutated subnetworks containing both well-known cancer genes and candidate cancer genes that are rarely mutated in the cohort. Hierarchical HotNet is a robust algorithm for identifying altered subnetworks across different ‘omics datasets.

    Availability and implementation

    http://github.com/raphael-group/hierarchical-hotnet.

    Supplementary information

    Supplementary material are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    There is recent interest in using gene expression data to contextualize findings from traditional genome-wide association studies (GWAS). Conditioned on a tissue, expression quantitative trait loci (eQTLs) are genetic variants associated with gene expression, and eGenes are genes whose expression levels are associated with genetic variants. eQTLs and eGenes provide great supporting evidence for GWAS hits and important insights into the regulatory pathways involved in many diseases. When a significant variant or a candidate gene identified by GWAS is also an eQTL or eGene, there is strong evidence to further study this variant or gene. Multi-tissue gene expression datasets like the Gene Tissue Expression (GTEx) data are used to find eQTLs and eGenes. Unfortunately, these datasets often have small sample sizes in some tissues. For this reason, there have been many meta-analysis methods designed to combine gene expression data across many tissues to increase power for finding eQTLs and eGenes. However, these existing techniques are not scalable to datasets containing many tissues, like the GTEx data. Furthermore, these methods ignore a biological insight that the same variant may be associated with the same gene across similar tissues.

    Results

    We introduce a meta-analysis model that addresses these problems in existing methods. We focus on the problem of finding eGenes in gene expression data from many tissues, and show that our model is better than other types of meta-analyses.

    Availability and Implementation

    Source code is at https://github.com/datduong/RECOV.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less