skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, June 13 until 2:00 AM ET on Friday, June 14 due to maintenance. We apologize for the inconvenience.

Title: Personalized Integrated Network Modeling of the Cancer Proteome Atlas

Personalized (patient-specific) approaches have recently emerged with a precision medicine paradigm that acknowledges the fact that molecular pathway structures and activity might be considerably different within and across tumors. The functional cancer genome and proteome provide rich sources of information to identify patient-specific variations in signaling pathways and activities within and across tumors; however, current analytic methods lack the ability to exploit the diverse and multi-layered architecture of these complex biological networks. We assessed pan-cancer pathway activities for >7700 patients across 32 tumor types from The Cancer Proteome Atlas by developing a personalized cancer-specific integrated network estimation (PRECISE) model. PRECISE is a general Bayesian framework for integrating existing interaction databases, data-drivende novocausal structures, and upstream molecular profiling data to estimate cancer-specific integrated networks, infer patient-specific networks and elicit interpretable pathway-level signatures. PRECISE-based pathway signatures, can delineate pan-cancer commonalities and differences in proteomic network biology within and across tumors, demonstrates robust tumor stratification that is both biologically and clinically informative and superior prognostic power compared to existing approaches. Towards establishing the translational relevance of the functional proteome in research and clinical settings, we provide an online, publicly available, comprehensive database and visualization repository of our findings (

more » « less
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Scientific Reports
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures. Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis. Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g. smoking history) or molecular features (e.g. inactivations to DNA damage repair genes). Results To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures. To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor’s observed covariates. We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations. On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures. We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. We also discover four signatures in a combined melanoma and lung cancer cohort—using cancer type as a covariate—and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas. Availability and implementation TCSM is implemented in Python 3 and available at, along with a data workflow for reproducing the experiments in the paper. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Abstract

    Motivation: Cancer is the process of accumulating genetic alterations that confer selective advantages to tumor cells. The order in which aberrations occur is not arbitrary, and inferring the order of events is challenging due to the lack of longitudinal samples from tumors. Moreover, a network model of oncogenesis should capture biological facts such as distinct progression trajectories of cancer subtypes and patterns of mutual exclusivity of alterations in the same pathways.

    In this paper, we present the disjunctive Bayesian network (DBN), a novel oncogenetic model with a phylogenetic interpretation. DBN is expressive enough to capture cancer subtypes' trajectories and mutually exclusive relations between alterations from unstratified data.

    Results: In cases where the number of studied alterations is small (), we provide an efficient dynamic programming implementation of an exact structure learning method that finds a best DBN in the superexponential search space of networks. In rare cases that the number of alterations is large, we provided an efficient genetic algorithm in our software package, OncoBN. Through numerous synthetic and real data experiments, we show OncoBN's ability in inferring ground truth networks and recovering biologically meaningful progression networks.

    Availability: OncoBN is implemented in R and is available at

    more » « less
  3. Abstract Background

    Advances in microbiome science are being driven in large part due to our ability to study and infer microbial ecology from genomes reconstructed from mixed microbial communities using metagenomics and single-cell genomics. Such omics-based techniques allow us to read genomic blueprints of microorganisms, decipher their functional capacities and activities, and reconstruct their roles in biogeochemical processes. Currently available tools for analyses of genomic data can annotate and depict metabolic functions to some extent; however, no standardized approaches are currently available for the comprehensive characterization of metabolic predictions, metabolite exchanges, microbial interactions, and microbial contributions to biogeochemical cycling.


    We present METABOLIC (METabolic And BiogeOchemistry anaLyses In miCrobes), a scalable software to advance microbial ecology and biogeochemistry studies using genomes at the resolution of individual organisms and/or microbial communities. The genome-scale workflow includes annotation of microbial genomes, motif validation of biochemically validated conserved protein residues, metabolic pathway analyses, and calculation of contributions to individual biogeochemical transformations and cycles. The community-scale workflow supplements genome-scale analyses with determination of genome abundance in the microbiome, potential microbial metabolic handoffs and metabolite exchange, reconstruction of functional networks, and determination of microbial contributions to biogeochemical cycles. METABOLIC can take input genomes from isolates, metagenome-assembled genomes, or single-cell genomes. Results are presented in the form of tables for metabolism and a variety of visualizations including biogeochemical cycling potential, representation of sequential metabolic transformations, community-scale microbial functional networks using a newly defined metric “MW-score” (metabolic weight score), and metabolic Sankey diagrams. METABOLIC takes ~ 3 h with 40 CPU threads to process ~ 100 genomes and corresponding metagenomic reads within which the most compute-demanding part of hmmsearch takes ~ 45 min, while it takes ~ 5 h to complete hmmsearch for ~ 3600 genomes. Tests of accuracy, robustness, and consistency suggest METABOLIC provides better performance compared to other software and online servers. To highlight the utility and versatility of METABOLIC, we demonstrate its capabilities on diverse metagenomic datasets from the marine subsurface, terrestrial subsurface, meadow soil, deep sea, freshwater lakes, wastewater, and the human gut.


    METABOLIC enables the consistent and reproducible study of microbial community ecology and biogeochemistry using a foundation of genome-informed microbial metabolism, and will advance the integration of uncultivated organisms into metabolic and biogeochemical models. METABOLIC is written in Perl and R and is freely available under GPLv3 at

    more » « less
  4. Colorectal cancer (CRC) is the third-most leading cause of cancer-related deaths in the United States. To advance the understanding of CRC tumor progression, models which mimic the tumor microenvironment (TME) and have translatable study outcomes are urgently needed. CRC patient-derived xenografts (PDXs) are promising tools for their ability to recapitulate tumor heterogeneity and key patient tumor characteristics, such as molecular characteristics. However, as in vivo models, CRC PDXs are costly and low-throughput, which leads to a need for equivalent in vitro models. To address this need, we previously established an in vitro model using a tissue engineering toolset with CRC PDX cells. However, it is unclear whether tissue engineering has the capacity to maintain patient- and/or cancer stage-specific tumor heterogeneity. To address this gap, we employed three PDX tumor lines, originated from stage II, III-B, and IV CRC tumors, in the formation of 3D engineered CRC PDX (3D-eCRC-PDX) tissues and performed an in-depth comparison between the 3D-eCRC-PDX tissues and the original CRC-PDX tumors. To form the tissues, CRC-PDX tumors were expanded in vivo and dissociated. The isolated cells were encapsulated within poly(ethylene glycol)-fibrinogen hydrogels and remained viable and proliferative post encapsulation over the course of 29 days in culture. To gain molecular insight into the maintenance of PDX line stage heterogeneity, we performed a transcriptomic analysis using RNA seq to determine the extent to which there were similarities and differences between the CRC-PDX tumors and the 3D-eCRC-PDX tissues. We observed the greatest correspondence in overlapping differentially expressed human genes, gene ontology, and Hallmark gene set enrichment between the 3D-eCRC-PDX tissues and CRC-PDX tumors in the stage II PDX line, while the least correspondence was observed in the stage IV PDX line. The Hallmark gene set enrichment from murine mapped RNA seq transcripts was PDX line-specific which suggested that the stromal component of the 3D-eCRC-PDX tissues was maintained in a PDX line-dependent manner. Consistent with our transcriptomic analysis, we observed that tumor cell subpopulations, including human proliferative (B2M+Ki67+) and CK20+ cells, remained constant for up to 15 days in culture even though the number of cells in the 3D-eCRC-PDX tissues from all three CRC stages increased over time. Yet, tumor cell subpopulation differences in the stage IV 3D-eCRC-PDX tissues were observed starting at 22 days in culture. Overall, our results demonstrate a strong correlation between our in vitro 3D-eCRC-PDX models and the originating in vivo CRC-PDX tumors, providing evidence that these engineered tissues may be capable of mimicking patient- and/or cancer stage-specific heterogeneity. 
    more » « less
  5. Abstract

    Mutual exclusivity of cancer driving mutations is a frequently observed phenomenon in the mutational landscape of cancer. The long tail of rare mutations complicates the discovery of mutually exclusive driver modules. The existing methods usually suffer from the problem that only few genes in some identified modules cover most of the cancer samples. To overcome this hurdle, an efficient method UniCovEx is presented via identifying mutually exclusive driver modules of balanced exclusive coverages. UniCovEx first searches for candidate driver modules with a strong topological relationship in signaling networks using a greedy strategy. It then evaluates the candidate modules by considering their coverage, exclusivity, and balance of coverage, using a novel metric termed exclusive entropy of modules, which measures how balanced the modules are. Finally, UniCovEx predicts sample‐specific driver modules by solving a minimum set cover problem using a greedy strategy. When tested on 12 The Cancer Genome Atlas datasets of different cancer types, UniCovEx shows a significant superiority over the previous methods. The software is available at:‐pathway/files/.

    more » « less