skip to main content

Title: Personalized Integrated Network Modeling of the Cancer Proteome Atlas

Personalized (patient-specific) approaches have recently emerged with a precision medicine paradigm that acknowledges the fact that molecular pathway structures and activity might be considerably different within and across tumors. The functional cancer genome and proteome provide rich sources of information to identify patient-specific variations in signaling pathways and activities within and across tumors; however, current analytic methods lack the ability to exploit the diverse and multi-layered architecture of these complex biological networks. We assessed pan-cancer pathway activities for >7700 patients across 32 tumor types from The Cancer Proteome Atlas by developing a personalized cancer-specific integrated network estimation (PRECISE) model. PRECISE is a general Bayesian framework for integrating existing interaction databases, data-drivende novocausal structures, and upstream molecular profiling data to estimate cancer-specific integrated networks, infer patient-specific networks and elicit interpretable pathway-level signatures. PRECISE-based pathway signatures, can delineate pan-cancer commonalities and differences in proteomic network biology within and across tumors, demonstrates robust tumor stratification that is both biologically and clinically informative and superior prognostic power compared to existing approaches. Towards establishing the translational relevance of the functional proteome in research and clinical settings, we provide an online, publicly available, comprehensive database and visualization repository of our findings (

; ; ; ; ; ;
Publication Date:
Journal Name:
Scientific Reports
Nature Publishing Group
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures. Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis. Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g. smoking history) or molecular features (e.g. inactivations to DNA damage repair genes). Results To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures. To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor’s observed covariates. We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations. On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recoveringmore »the ground truth exposure to similar signatures. We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. We also discover four signatures in a combined melanoma and lung cancer cohort—using cancer type as a covariate—and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas. Availability and implementation TCSM is implemented in Python 3 and available at, along with a data workflow for reproducing the experiments in the paper. Supplementary information Supplementary data are available at Bioinformatics online.« less
  2. Abstract Background

    Advances in microbiome science are being driven in large part due to our ability to study and infer microbial ecology from genomes reconstructed from mixed microbial communities using metagenomics and single-cell genomics. Such omics-based techniques allow us to read genomic blueprints of microorganisms, decipher their functional capacities and activities, and reconstruct their roles in biogeochemical processes. Currently available tools for analyses of genomic data can annotate and depict metabolic functions to some extent; however, no standardized approaches are currently available for the comprehensive characterization of metabolic predictions, metabolite exchanges, microbial interactions, and microbial contributions to biogeochemical cycling.


    We present METABOLIC (METabolic And BiogeOchemistry anaLyses In miCrobes), a scalable software to advance microbial ecology and biogeochemistry studies using genomes at the resolution of individual organisms and/or microbial communities. The genome-scale workflow includes annotation of microbial genomes, motif validation of biochemically validated conserved protein residues, metabolic pathway analyses, and calculation of contributions to individual biogeochemical transformations and cycles. The community-scale workflow supplements genome-scale analyses with determination of genome abundance in the microbiome, potential microbial metabolic handoffs and metabolite exchange, reconstruction of functional networks, and determination of microbial contributions to biogeochemical cycles. METABOLIC can take input genomes from isolates, metagenome-assembled genomes, ormore »single-cell genomes. Results are presented in the form of tables for metabolism and a variety of visualizations including biogeochemical cycling potential, representation of sequential metabolic transformations, community-scale microbial functional networks using a newly defined metric “MW-score” (metabolic weight score), and metabolic Sankey diagrams. METABOLIC takes ~ 3 h with 40 CPU threads to process ~ 100 genomes and corresponding metagenomic reads within which the most compute-demanding part of hmmsearch takes ~ 45 min, while it takes ~ 5 h to complete hmmsearch for ~ 3600 genomes. Tests of accuracy, robustness, and consistency suggest METABOLIC provides better performance compared to other software and online servers. To highlight the utility and versatility of METABOLIC, we demonstrate its capabilities on diverse metagenomic datasets from the marine subsurface, terrestrial subsurface, meadow soil, deep sea, freshwater lakes, wastewater, and the human gut.


    METABOLIC enables the consistent and reproducible study of microbial community ecology and biogeochemistry using a foundation of genome-informed microbial metabolism, and will advance the integration of uncultivated organisms into metabolic and biogeochemical models. METABOLIC is written in Perl and R and is freely available under GPLv3 at

    « less
  3. Abstract Background

    Identifying splice site regions is an important step in the genomic DNA sequencing pipelines of biomedical and pharmaceutical research. Within this research purview, efficient and accurate splice site detection is highly desirable, and a variety of computational models have been developed toward this end. Neural network architectures have recently been shown to outperform classical machine learning approaches for the task of splice site prediction. Despite these advances, there is still considerable potential for improvement, especially regarding model prediction accuracy, and error rate.


    Given these deficits, we propose EnsembleSplice, an ensemble learning architecture made up of four (4) distinct convolutional neural networks (CNN) model architecture combination that outperform existing splice site detection methods in the experimental evaluation metrics considered including the accuracies and error rates. We trained and tested a variety of ensembles made up of CNNs and DNNs using the five-fold cross-validation method to identify the model that performed the best across the evaluation and diversity metrics. As a result, we developed our diverse and highly effective splice site (SS) detection model, which we evaluated using two (2) genomicHomo sapiensdatasets and theArabidopsis thalianadataset. The results showed that for of theHomo sapiensEnsembleSplice achieved accuracies of 94.16% for one of themore »acceptor splice sites and 95.97% for donor splice sites, with an error rate for the sameHomo sapiensdataset, 4.03% for the donor splice sites and 5.84% for theacceptor splice sites datasets.


    Our five-fold cross validation ensured the prediction accuracy of our models are consistent. For reproducibility, all the datasets used, models generated, and results in our work are publicly available in our GitHub repository here:

    « less
  4. Abstract

    Diverse processes in cancer are mediated by enzymes, which most proximally exert their function through their activity. High-fidelity methods to profile enzyme activity are therefore critical to understanding and targeting the pathological roles of enzymes in cancer. Here, we present an integrated set of methods for measuring specific protease activities across scales, and deploy these methods to study treatment response in an autochthonous model ofAlk-mutant lung cancer. We leverage multiplexed nanosensors and machine learning to analyze in vivo protease activity dynamics in lung cancer, identifying significant dysregulation that includes enhanced cleavage of a peptide, S1, which rapidly returns to healthy levels with targeted therapy. Through direct on-tissue localization of protease activity, we pinpoint S1 cleavage to the tumor vasculature. To link protease activity to cellular function, we design a high-throughput method to isolate and characterize proteolytically active cells, uncovering a pro-angiogenic phenotype in S1-cleaving cells. These methods provide a framework for functional, multiscale characterization of protease dysregulation in cancer.

  5. There are currently no effective biomarkers for prognosis and optimal treatment selection to improve non-small cell lung cancer (NSCLC) survival outcomes. This study further validated a seven-gene panel for diagnosis and prognosis of NSCLC using RNA sequencing and proteomic profiles of patient tumors. Within the seven-gene panel, ZNF71 expression combined with dendritic cell activities defined NSCLC patient subgroups (n = 966) with distinct survival outcomes (p = 0.04, Kaplan–Meier analysis). ZNF71 expression was significantly associated with the activities of natural killer cells (p = 0.014) and natural killer T cells (p = 0.003) in NSCLC patient tumors (n = 1016) using Chi-squared tests. Overexpression of ZNF71 resulted in decreased expression of multiple components of the intracellular intrinsic and innate immune systems, including dsRNA and dsDNA sensors. Multi-omics networks of ZNF71 and the intracellular intrinsic and innate immune systems were computed as relevant to NSCLC tumorigenesis, proliferation, and survival using patient clinical information and in-vitro CRISPR-Cas9/RNAi screening data. From these networks, pan-sensitive and pan-resistant genes to 21 NCCN-recommended drugs for treating NSCLC were selected. Based on the gene associations with patient survival and in-vitro CRISPR-Cas9, RNAi, and drug screening data, MEK1/2 inhibitors PD-198306 and U-0126, VEGFR inhibitor ZM-306416, and IGF-1R inhibitormore »PQ-401 were discovered as potential targeted therapy that may also induce an immune response for treating NSCLC.« less