skip to main content

Title: Modeling clinical and molecular covariates of mutational process activity in cancer
Abstract Motivation Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures. Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis. Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g. smoking history) or molecular features (e.g. inactivations to DNA damage repair genes). Results To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures. To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor’s observed covariates. We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations. On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering more » the ground truth exposure to similar signatures. We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. We also discover four signatures in a combined melanoma and lung cancer cohort—using cancer type as a covariate—and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas. Availability and implementation TCSM is implemented in Python 3 and available at, along with a data workflow for reproducing the experiments in the paper. Supplementary information Supplementary data are available at Bioinformatics online. « less
; ;
Award ID(s):
Publication Date:
Journal Name:
Page Range or eLocation-ID:
i492 to i500
Sponsoring Org:
National Science Foundation
More Like this
  1. RRM2B plays a crucial role in DNA replication, repair and oxidative stress. While germline RRM2B mutations have been implicated in mitochondrial disorders, its relevance to cancer has not been established. Here, using TCGA studies, we investigated RRM2B alterations in cancer. We found that RRM2B is highly amplified in multiple tumor types, particularly in MYC -amplified tumors, and is associated with increased RRM2B mRNA expression. We also observed that the chromosomal region 8q22.3–8q24, is amplified in multiple tumors, and includes RRM2B , MYC along with several other cancer-associated genes. An analysis of genes within this 8q-amplicon showed that cancers that have both RRM2B -amplified along with MYC have a distinct pattern of amplification compared to cancers that are unaltered or those that have amplifications in RRM2B or MYC only. Investigation of curated biological interactions revealed that gene products of the amplified 8q22.3–8q24 region have important roles in DNA repair, DNA damage response, oxygen sensing, and apoptosis pathways and interact functionally. Notably, RRM2B -amplified cancers are characterized by mutation signatures of defective DNA repair and oxidative stress, and at least RRM2B -amplified breast cancers are associated with poor clinical outcome. These data suggest alterations in RR2MB and possibly the interacting 8q-proteins couldmore »have a profound effect on regulatory pathways such as DNA repair and cellular survival, highlighting therapeutic opportunities in these cancers.« less
  2. Abstract BACKGROUND

    Despite widespread interest in next-generation sequencing (NGS), the adoption of personalized clinical genomics and mutation profiling of cancer specimens is lagging, in part because of technical limitations. Tumors are genetically heterogeneous and often contain normal/stromal cells, features that lead to low-abundance somatic mutations that generate ambiguous results or reside below NGS detection limits, thus hindering the clinical sensitivity/specificity standards of mutation calling. We applied COLD-PCR (coamplification at lower denaturation temperature PCR), a PCR methodology that selectively enriches variants, to improve the detection of unknown mutations before NGS-based amplicon resequencing.


    We used both COLD-PCR and conventional PCR (for comparison) to amplify serially diluted mutation-containing cell-line DNA diluted into wild-type DNA, as well as DNA from lung adenocarcinoma and colorectal cancer samples. After amplification of TP53 (tumor protein p53), KRAS (v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog), IDH1 [isocitrate dehydrogenase 1 (NADP+), soluble], and EGFR (epidermal growth factor receptor) gene regions, PCR products were pooled for library preparation, bar-coded, and sequenced on the Illumina HiSeq 2000.


    In agreement with recent findings, sequencing errors by conventional targeted-amplicon approaches dictated a mutation-detection limit of approximately 1%–2%. Conversely, COLD-PCR amplicons enriched mutations above the error-related noise, enabling reliable identification of mutation abundances of approximatelymore »0.04%. Sequencing depth was not a large factor in the identification of COLD-PCR–enriched mutations. For the clinical samples, several missense mutations were not called with conventional amplicons, yet they were clearly detectable with COLD-PCR amplicons. Tumor heterogeneity for the TP53 gene was apparent.


    As cancer care shifts toward personalized intervention based on each patient's unique genetic abnormalities and tumor genome, we anticipate that COLD-PCR combined with NGS will elucidate the role of mutations in tumor progression, enabling NGS-based analysis of diverse clinical specimens within clinical practice.

    « less
  3. Cancer genomics has enabled the exhaustive molecular characterization of tumors and exposed hepatocellular carcinoma (HCC) as among the most complex cancers. This complexity is paralleled by dozens of mouse models that generate histologically similar tumors but have not been systematically validated at the molecular level. Accurate models of the molecular pathogenesis of HCC are essential for biomedical progress; therefore we compared genomic and transcriptomic profiles of four separate mouse models [MUP transgenic, TAK1-knockout, carcinogen-driven diethylnitrosamine (DEN), and Stelic Animal Model (STAM)] with those of 987 HCC patients with distinct etiologies. These four models differed substantially in their mutational load, mutational signatures, affected genes and pathways, and transcriptomes. STAM tumors were most molecularly similar to human HCC, with frequent mutations inCtnnb1, similar pathway alterations, and high transcriptomic similarity to high-grade, proliferative human tumors with poor prognosis. In contrast, TAK1 tumors better reflected the mutational signature of human HCC and were transcriptionally similar to low-grade human tumors. DEN tumors were least similar to human disease and almost universally carried theBrafV637E mutation, which is rarely found in human HCC. Immune analysis revealed that strain-specific MHC-I genotype can influence the molecular makeup of murine tumors. Thus, different mouse models of HCC recapitulate distinct aspectsmore »of HCC biology, and their use should be adapted to specific questions based on the molecular features provided here.

    « less
  4. Abstract Background DNA methylation is an epigenetic event involving the addition of a methyl-group to a cytosine-guanine base pair (i.e., CpG site). It is associated with different cancers. Our research focuses on studying non-small cell lung cancer hemimethylation, which refers to methylation occurring on only one of the two DNA strands. Many studies often assume that methylation occurs on both DNA strands at a CpG site. However, recent publications show the existence of hemimethylation and its significant impact. Therefore, it is important to identify cancer hemimethylation patterns. Methods In this paper, we use the Wilcoxon signed rank test to identify hemimethylated CpG sites based on publicly available non-small cell lung cancer methylation sequencing data. We then identify two types of hemimethylated CpG clusters, regular and polarity clusters, and genes with large numbers of hemimethylated sites. Highly hemimethylated genes are then studied for their biological interactions using available bioinformatics tools. Results In this paper, we have conducted the first-ever investigation of hemimethylation in lung cancer. Our results show that hemimethylation does exist in lung cells either as singletons or clusters. Most clusters contain only two or three CpG sites. Polarity clusters are much shorter than regular clusters and appear less frequently.more »The majority of clusters found in tumor samples have no overlap with clusters found in normal samples, and vice versa. Several genes that are known to be associated with cancer are hemimethylated differently between the cancerous and normal samples. Furthermore, highly hemimethylated genes exhibit many different interactions with other genes that may be associated with cancer. Hemimethylation has diverse patterns and frequencies that are comparable between normal and tumorous cells. Therefore, hemimethylation may be related to both normal and tumor cell development. Conclusions Our research has identified CpG clusters and genes that are hemimethylated in normal and lung tumor samples. Due to the potential impact of hemimethylation on gene expression and cell function, these clusters and genes may be important to advance our understanding of the development and progression of non-small cell lung cancer.« less
  5. Abstract

    Protein binding microarrays provide comprehensive information about the DNA binding specificities of transcription factors (TFs), and can be used to quantitatively predict the effects of DNA sequence variation on TF binding. There has also been substantial progress in dissecting the patterns of mutations, i.e., the mutational signatures, generated by different mutational processes. By combining these two layers of information we can investigate whether certain mutational processes tend to preferentially affect binding of particular classes of TFs. Such preferential alterations of binding might predispose to particular oncogenic pathways. We developed and implemented a method, termed Signature-QBiC, that integrates protein binding microarray data with the signatures of mutational processes, with the aim of predicting which TFs’ binding profiles are preferentially perturbed by particular mutational processes. We used Signature-QBiC to predict the effects of 47 signatures of mutational processes on 582 human TFs. Pathway analysis showed that binding of TFs involved in NOTCH1 signaling is strongly affected by the signatures of several mutational processes, including exposure to ultraviolet radiation. Additionally, toll-like-receptor signaling pathways are also vulnerable to disruption by this exposure. This study provides a novel overview of the effects of mutational processes on TF binding and the potential of these processesmore »to activate oncogenic pathways through mutating TF binding sites.

    « less