skip to main content


Title: Modeling clinical and molecular covariates of mutational process activity in cancer
Abstract Motivation Somatic mutations result from processes related to DNA replication or environmental/lifestyle exposures. Knowing the activity of mutational processes in a tumor can inform personalized therapies, early detection, and understanding of tumorigenesis. Computational methods have revealed 30 validated signatures of mutational processes active in human cancers, where each signature is a pattern of single base substitutions. However, half of these signatures have no known etiology, and some similar signatures have distinct etiologies, making patterns of mutation signature activity hard to interpret. Existing mutation signature detection methods do not consider tumor-level clinical/demographic (e.g. smoking history) or molecular features (e.g. inactivations to DNA damage repair genes). Results To begin to address these challenges, we present the Tumor Covariate Signature Model (TCSM), the first method to directly model the effect of observed tumor-level covariates on mutation signatures. To this end, our model uses methods from Bayesian topic modeling to change the prior distribution on signature exposure conditioned on a tumor’s observed covariates. We also introduce methods for imputing covariates in held-out data and for evaluating the statistical significance of signature-covariate associations. On simulated and real data, we find that TCSM outperforms both non-negative matrix factorization and topic modeling-based approaches, particularly in recovering the ground truth exposure to similar signatures. We then use TCSM to discover five mutation signatures in breast cancer and predict homologous recombination repair deficiency in held-out tumors. We also discover four signatures in a combined melanoma and lung cancer cohort—using cancer type as a covariate—and provide statistical evidence to support earlier claims that three lung cancers from The Cancer Genome Atlas are misdiagnosed metastatic melanomas. Availability and implementation TCSM is implemented in Python 3 and available at https://github.com/lrgr/tcsm, along with a data workflow for reproducing the experiments in the paper. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1632976
NSF-PAR ID:
10111170
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Bioinformatics
Volume:
35
Issue:
14
ISSN:
1367-4803
Page Range / eLocation ID:
i492 to i500
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    RRM2B plays a crucial role in DNA replication, repair and oxidative stress. While germline RRM2B mutations have been implicated in mitochondrial disorders, its relevance to cancer has not been established. Here, using TCGA studies, we investigated RRM2B alterations in cancer. We found that RRM2B is highly amplified in multiple tumor types, particularly in MYC -amplified tumors, and is associated with increased RRM2B mRNA expression. We also observed that the chromosomal region 8q22.3–8q24, is amplified in multiple tumors, and includes RRM2B , MYC along with several other cancer-associated genes. An analysis of genes within this 8q-amplicon showed that cancers that have both RRM2B -amplified along with MYC have a distinct pattern of amplification compared to cancers that are unaltered or those that have amplifications in RRM2B or MYC only. Investigation of curated biological interactions revealed that gene products of the amplified 8q22.3–8q24 region have important roles in DNA repair, DNA damage response, oxygen sensing, and apoptosis pathways and interact functionally. Notably, RRM2B -amplified cancers are characterized by mutation signatures of defective DNA repair and oxidative stress, and at least RRM2B -amplified breast cancers are associated with poor clinical outcome. These data suggest alterations in RR2MB and possibly the interacting 8q-proteins could have a profound effect on regulatory pathways such as DNA repair and cellular survival, highlighting therapeutic opportunities in these cancers. 
    more » « less
  2. Abstract BACKGROUND

    Despite widespread interest in next-generation sequencing (NGS), the adoption of personalized clinical genomics and mutation profiling of cancer specimens is lagging, in part because of technical limitations. Tumors are genetically heterogeneous and often contain normal/stromal cells, features that lead to low-abundance somatic mutations that generate ambiguous results or reside below NGS detection limits, thus hindering the clinical sensitivity/specificity standards of mutation calling. We applied COLD-PCR (coamplification at lower denaturation temperature PCR), a PCR methodology that selectively enriches variants, to improve the detection of unknown mutations before NGS-based amplicon resequencing.

    METHODS

    We used both COLD-PCR and conventional PCR (for comparison) to amplify serially diluted mutation-containing cell-line DNA diluted into wild-type DNA, as well as DNA from lung adenocarcinoma and colorectal cancer samples. After amplification of TP53 (tumor protein p53), KRAS (v-Ki-ras2 Kirsten rat sarcoma viral oncogene homolog), IDH1 [isocitrate dehydrogenase 1 (NADP+), soluble], and EGFR (epidermal growth factor receptor) gene regions, PCR products were pooled for library preparation, bar-coded, and sequenced on the Illumina HiSeq 2000.

    RESULTS

    In agreement with recent findings, sequencing errors by conventional targeted-amplicon approaches dictated a mutation-detection limit of approximately 1%–2%. Conversely, COLD-PCR amplicons enriched mutations above the error-related noise, enabling reliable identification of mutation abundances of approximately 0.04%. Sequencing depth was not a large factor in the identification of COLD-PCR–enriched mutations. For the clinical samples, several missense mutations were not called with conventional amplicons, yet they were clearly detectable with COLD-PCR amplicons. Tumor heterogeneity for the TP53 gene was apparent.

    CONCLUSIONS

    As cancer care shifts toward personalized intervention based on each patient's unique genetic abnormalities and tumor genome, we anticipate that COLD-PCR combined with NGS will elucidate the role of mutations in tumor progression, enabling NGS-based analysis of diverse clinical specimens within clinical practice.

     
    more » « less
  3. Abstract

    The successful development and implementation of precision immuno-oncology therapies requires a deeper understanding of the immune architecture at a patient level. T-cell receptor (TCR) repertoire sequencing is a relatively new technology that enables monitoring of T-cells, a subset of immune cells that play a central role in modulating immune response. These immunologic relationships are complex and are governed by various distributional aspects of an individual patient's tumor profile. We propose Bayesian QUANTIle regression for hierarchical COvariates (QUANTICO) that allows simultaneous modeling of hierarchical relationships between multilevel covariates, conducts explicit variable selection, estimates quantile and patient-specific coefficient effects, to induce individualized inference. We show QUANTICO outperforms existing approaches in multiple simulation scenarios. We demonstrate the utility of QUANTICO to investigate the effect of TCR variables on immune response in a cohort of lung cancer patients. At population level, our analyses reveal the mechanistic role of T-cell proportion on the immune cell abundance, with tumor mutation burden as an important factor modulating this relationship. At a patient level, we find several outlier patients based on their quantile-specific coefficient functions, who have higher mutational rates and different smoking history.

     
    more » « less
  4. null (Ed.)
    Abstract Background DNA methylation is an epigenetic event involving the addition of a methyl-group to a cytosine-guanine base pair (i.e., CpG site). It is associated with different cancers. Our research focuses on studying non-small cell lung cancer hemimethylation, which refers to methylation occurring on only one of the two DNA strands. Many studies often assume that methylation occurs on both DNA strands at a CpG site. However, recent publications show the existence of hemimethylation and its significant impact. Therefore, it is important to identify cancer hemimethylation patterns. Methods In this paper, we use the Wilcoxon signed rank test to identify hemimethylated CpG sites based on publicly available non-small cell lung cancer methylation sequencing data. We then identify two types of hemimethylated CpG clusters, regular and polarity clusters, and genes with large numbers of hemimethylated sites. Highly hemimethylated genes are then studied for their biological interactions using available bioinformatics tools. Results In this paper, we have conducted the first-ever investigation of hemimethylation in lung cancer. Our results show that hemimethylation does exist in lung cells either as singletons or clusters. Most clusters contain only two or three CpG sites. Polarity clusters are much shorter than regular clusters and appear less frequently. The majority of clusters found in tumor samples have no overlap with clusters found in normal samples, and vice versa. Several genes that are known to be associated with cancer are hemimethylated differently between the cancerous and normal samples. Furthermore, highly hemimethylated genes exhibit many different interactions with other genes that may be associated with cancer. Hemimethylation has diverse patterns and frequencies that are comparable between normal and tumorous cells. Therefore, hemimethylation may be related to both normal and tumor cell development. Conclusions Our research has identified CpG clusters and genes that are hemimethylated in normal and lung tumor samples. Due to the potential impact of hemimethylation on gene expression and cell function, these clusters and genes may be important to advance our understanding of the development and progression of non-small cell lung cancer. 
    more » « less
  5. Abstract

    Identification and characterization of somatic mutations in cancer have important prognostication and treatment implications. Genes encoding the Nuclear factor (erythroid-derived 2)-like 2 (NRF2) transcription factor and its negative regulator, Kelch-like ECH-associated protein 1 (KEAP1), are frequently mutated in cancer. These mutations drive constitutive NRF2 activation and correlate with poor prognosis. Despite its apparent significance, a comprehensive catalogue of somatic NRF2 mutations across different tumor types is still lacking. Here, we catalogue NRF2 mutations in The Cancer Genome Atlas (TCGA) database. 226 unique NRF2-mutant tumors were identified from 10,364 cases. NRF2 mutations were found in 21 out of the 33 tumor types. A total of 11 hotspots were identified. Of these, mutation to the R34 position was most frequent. Notably, R34 and D29 mutations were overrepresented in bladder, lung, and uterine cancers. Analyses of corresponding RNA sequencing data using a de novo derived gene expression classifier showed that the R34 mutations drive constitutive NRF2 activation with a selection pressure biased against the formation of R34L. Of all R34 mutants, R34L conferred the least degree of protein stabilization, suggesting a pro-tumor NRF2 half-life threshold. Our findings offer a comprehensive catalogue of NRF2 mutations in cancer that can help prognostication and NRF2 research.

     
    more » « less