skip to main content

Title: Machine learning empowers phosphoproteome prediction in cancers
Abstract Motivation

Reversible protein phosphorylation is an essential post-translational modification regulating protein functions and signaling pathways in many cellular processes. Aberrant activation of signaling pathways often contributes to cancer development and progression. The mass spectrometry-based phosphoproteomics technique is a powerful tool to investigate the site-level phosphorylation of the proteome in a global fashion, paving the way for understanding the regulatory mechanisms underlying cancers. However, this approach is time-consuming and requires expensive instruments, specialized expertise and a large amount of starting material. An alternative in silico approach is predicting the phosphoproteomic profiles of cancer patients from the available proteomic, transcriptomic and genomic data.

Results

Here, we present a winning algorithm in the 2017 NCI-CPTAC DREAM Proteogenomics Challenge for predicting phosphorylation levels of the proteome across cancer patients. We integrate four components into our algorithm, including (i) baseline correlations between protein and phosphoprotein abundances, (ii) universal protein–protein interactions, (iii) shareable regulatory information across cancer tissues and (iv) associations among multi-phosphorylation sites of the same protein. When tested on a large held-out testing dataset of 108 breast and 62 ovarian cancer samples, our method ranked first in both cancer tissues, demonstrating its robustness and generalization ability.

Availability and implementation

Our code and reproducible results are freely more » available on GitHub: https://github.com/GuanLab/phosphoproteome_prediction.

Supplementary information

Supplementary data are available at Bioinformatics online.

« less
Authors:
 ;  ;
Publication Date:
NSF-PAR ID:
10114674
Journal Name:
Bioinformatics
ISSN:
1367-4803
Publisher:
Oxford University Press
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Gene regulatory networks (GRNs) of the same organism can be different under different conditions, although the overall network structure may be similar. Understanding the difference in GRNs under different conditions is important to understand condition-specific gene regulation. When gene expression and other relevant data under two different conditions are available, they can be used by an existing network inference algorithm to estimate two GRNs separately, and then to identify the difference between the two GRNs. However, such an approach does not exploit the similarity in two GRNs, and may sacrifice inference accuracy.

    Results

    In this paper, we model GRNs with the structural equation model (SEM) that can integrate gene expression and genetic perturbation data, and develop an algorithm named fused sparse SEM (FSSEM), to jointly infer GRNs under two conditions, and then to identify difference of the two GRNs. Computer simulations demonstrate that the FSSEM algorithm outperforms the approaches that estimate two GRNs separately. Analysis of a dataset of lung cancer and another dataset of gastric cancer with FSSEM inferred differential GRNs in cancer versus normal tissues, whose genes with largest network degrees have been reported to be implicated in tumorigenesis. The FSSEM algorithm provides a valuable tool for jointmore »inference of two GRNs and identification of the differential GRN under two conditions.

    Availability and implementation

    The R package fssemR implementing the FSSEM algorithm is available at https://github.com/Ivis4ml/fssemR.git. It is also available on CRAN.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  2. Abstract Motivation

    Breast cancer is a type of cancer that develops in breast tissues, and, after skin cancer, it is the most commonly diagnosed cancer in women in the United States. Given that an early diagnosis is imperative to prevent breast cancer progression, many machine learning models have been developed in recent years to automate the histopathological classification of the different types of carcinomas. However, many of them are not scalable to large-scale datasets.

    Results

    In this study, we propose the novel Primal-Dual Multi-Instance Support Vector Machine to determine which tissue segments in an image exhibit an indication of an abnormality. We derive an efficient optimization algorithm for the proposed objective by bypassing the quadratic programming and least-squares problems, which are commonly employed to optimize Support Vector Machine models. The proposed method is computationally efficient, thereby it is scalable to large-scale datasets. We applied our method to the public BreaKHis dataset and achieved promising prediction performance and scalability for histopathological classification.

    Availability and implementation

    Software is publicly available at: https://1drv.ms/u/s!AiFpD21bgf2wgRLbQq08ixD0SgRD?e=OpqEmY.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

  3. Abstract Motivation

    Predicting pathological complete response (pCR) to neoadjuvant chemotherapy (NAC) in triple-negative breast cancer (TNBC) patients accurately is direly needed for clinical decision making. pCR is also regarded as a strong predictor of overall survival. In this work, we propose a deep learning system to predict pCR to NAC based on serial pathology images stained with hematoxylin and eosin and two immunohistochemical biomarkers (Ki67 and PHH3). To support human prior domain knowledge-based guidance and enhance interpretability of the deep learning system, we introduce a human knowledge-derived spatial attention mechanism to inform deep learning models of informative tissue areas of interest. For each patient, three serial breast tumor tissue sections from biopsy blocks were sectioned, stained in three different stains and integrated. The resulting comprehensive attention information from the image triplets is used to guide our prediction system for prognostic tissue regions.

    Results

    The experimental dataset consists of 26 419 pathology image patches of 1000×1000 pixels from 73 TNBC patients treated with NAC. Image patches from randomly selected 43 patients are used as a training dataset and images patches from the rest 30 are used as a testing dataset. By the maximum voting from patch-level results, our proposed model achieves a 93% patient-levelmore »accuracy, outperforming baselines and other state-of-the-art systems, suggesting its high potential for clinical decision making.

    Availability and implementation

    The codes, the documentation and example data are available on an open source at: https://github.com/jkonglab/PCR_Prediction_Serial_WSIs_biomarkers

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  4. Abstract Motivation

    Protein function prediction, based on the patterns of connection in a protein–protein interaction (or association) network, is perhaps the most studied of the classical, fundamental inference problems for biological networks. A highly successful set of recent approaches use random walk-based low-dimensional embeddings that tend to place functionally similar proteins into coherent spatial regions. However, these approaches lose valuable local graph structure from the network when considering only the embedding. We introduce GLIDER, a method that replaces a protein–protein interaction or association network with a new graph-based similarity network. GLIDER is based on a variant of our previous GLIDE method, which was designed to predict missing links in protein–protein association networks, capturing implicit local and global (i.e. embedding-based) graph properties.

    Results

    GLIDER outperforms competing methods on the task of predicting GO functional labels in cross-validation on a heterogeneous collection of four human protein–protein association networks derived from the 2016 DREAM Disease Module Identification Challenge, and also on three different protein–protein association networks built from the STRING database. We show that this is due to the strong functional enrichment that is present in the local GLIDER neighborhood in multiple different types of protein–protein association networks. Furthermore, we introduce the GLIDER graph neighborhoodmore »as a way for biologists to visualize the local neighborhood of a disease gene. As an application, we look at the local GLIDER neighborhoods of a set of known Parkinson’s Disease GWAS genes, rediscover many genes which have known involvement in Parkinson’s disease pathways, plus suggest some new genes to study.

    Availability and implementation

    All code is publicly available and can be accessed here: https://github.com/kap-devkota/GLIDER.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less
  5. Abstract Motivation

    The analysis of spatially resolved transcriptome enables the understanding of the spatial interactions between the cellular environment and transcriptional regulation. In particular, the characterization of the gene–gene co-expression at distinct spatial locations or cell types in the tissue enables delineation of spatial co-regulatory patterns as opposed to standard differential single gene analyses. To enhance the ability and potential of spatial transcriptomics technologies to drive biological discovery, we develop a statistical framework to detect gene co-expression patterns in a spatially structured tissue consisting of different clusters in the form of cell classes or tissue domains.

    Results

    We develop SpaceX (spatially dependent gene co-expression network), a Bayesian methodology to identify both shared and cluster-specific co-expression network across genes. SpaceX uses an over-dispersed spatial Poisson model coupled with a high-dimensional factor model which is based on a dimension reduction technique for computational efficiency. We show via simulations, accuracy gains in co-expression network estimation and structure by accounting for (increasing) spatial correlation and appropriate noise distributions. In-depth analysis of two spatial transcriptomics datasets in mouse hypothalamus and human breast cancer using SpaceX, detected multiple hub genes which are related to cognitive abilities for the hypothalamus data and multiple cancer genes (e.g. collagen family) frommore »the tumor region for the breast cancer data.

    Availability and implementation

    The SpaceX R-package is available at github.com/bayesrx/SpaceX.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    « less