skip to main content


Title: MaxMIF: A New Method for Identifying Cancer Driver Genes through Effective Data Integration
Abstract

Identification of a few cancer driver mutation genes from a much larger number of passenger mutation genes in cancer samples remains a highly challenging task. Here, a novel method for distinguishing the driver genes from the passenger genes by effective integration of somatic mutation data and molecular interaction data using a maximal mutational impact function (MaxMIF) is presented. When evaluated on six somatic mutation datasets of Pan‐Cancer and 19 datasets of different cancer types from TCGA, MaxMIF almost always significantly outperforms all the existing state‐of‐the‐art methods in terms of predictive accuracy, sensitivity, and specificity. It recovers about 30% more known cancer genes in 500 top‐ranked candidate genes than the best among the other tools evaluated. MaxMIF is also highly robust to data perturbation. Intriguingly, MaxMIF is able to identify potential cancer driver genes, with strong experimental data support. Therefore, MaxMIF can be very useful for identifying or prioritizing cancer driver genes in the increasing number of available cancer genomic data.

 
more » « less
NSF-PAR ID:
10065268
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Advanced Science
Volume:
5
Issue:
9
ISSN:
2198-3844
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    The somatic mutations in the pathways that drive cancer development tend to be mutually exclusive across tumors, providing a signal for distinguishing driver mutations from a larger number of random passenger mutations. This mutual exclusivity signal can be confounded by high and highly variable mutation rates across a cohort of samples. Current statistical tests for exclusivity that incorporate both per-gene and per-sample mutational frequencies are computationally expensive and have limited precision.

    Results

    We formulate a weighted exact test for assessing the significance of mutual exclusivity in an arbitrary number of mutational events. Our test conditions on the number of samples with a mutation as well as per-event, per-sample mutation probabilities. We provide a recursive formula to compute P-values for the weighted test exactly as well as a highly accurate and efficient saddlepoint approximation of the test. We use our test to approximate a commonly used permutation test for exclusivity that conditions on per-event, per-sample mutation frequencies. However, our test is more efficient and it recovers more significant results than the permutation test. We use our Weighted Exclusivity Test (WExT) software to analyze hundreds of colorectal and endometrial samples from The Cancer Genome Atlas, which are two cancer types that often have extremely high mutation rates. On both cancer types, the weighted test identifies sets of mutually exclusive mutations in cancer genes with fewer false positives than earlier approaches.

    Availability and Implementation

    See http://compbio.cs.brown.edu/projects/wext for software.

    Contact

    braphael@cs.brown.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract Motivation

    The analysis of high-dimensional ‘omics data is often informed by the use of biological interaction networks. For example, protein–protein interaction networks have been used to analyze gene expression data, to prioritize germline variants, and to identify somatic driver mutations in cancer. In these and other applications, the underlying computational problem is to identify altered subnetworks containing genes that are both highly altered in an ‘omics dataset and are topologically close (e.g. connected) on an interaction network.

    Results

    We introduce Hierarchical HotNet, an algorithm that finds a hierarchy of altered subnetworks. Hierarchical HotNet assesses the statistical significance of the resulting subnetworks over a range of biological scales and explicitly controls for ascertainment bias in the network. We evaluate the performance of Hierarchical HotNet and several other algorithms that identify altered subnetworks on the problem of predicting cancer genes and significantly mutated subnetworks. On somatic mutation data from The Cancer Genome Atlas, Hierarchical HotNet outperforms other methods and identifies significantly mutated subnetworks containing both well-known cancer genes and candidate cancer genes that are rarely mutated in the cohort. Hierarchical HotNet is a robust algorithm for identifying altered subnetworks across different ‘omics datasets.

    Availability and implementation

    http://github.com/raphael-group/hierarchical-hotnet.

    Supplementary information

    Supplementary material are available at Bioinformatics online.

     
    more » « less
  3. Cancer is a complex disease associated with abnormal DNA mutations. Not all tumors are cancerous and not all cancers are the same. Correct cancer type diagnosis can indicate the most effective drug therapy and increase survival rate. At the molecular level, it has been shown that cancer type classification can be carried out from the analysis of somatic point mutation. However, the high dimensionality and sparsity of genomic mutation data, coupled with its small sample size has been a hindrance in accurate classification of cancer. We address these problems by introducing a novel classification method called mClass that accounts for the sparsity of the data. mClass is a feature selection method that ranks genes based on their similarity across samples and employs their normalized mutual information to determine the set of genes that provide optimal classification accuracy. Experimental results on TCGA datasets show that mClass significantly improves testing accuracy compared to DeepGene, which is the state-of-the-art in cancer-type classification based on somatic mutation data. In addition, when compared with other cancer gene prediction tools, the set of genes selected by mClass contains the highest number of genes in top 100 genes listed in the Cancer Gene Census. mClass is available at https://github.com/mdahasan/mClass. 
    more » « less
  4. Abstract Sequencing of melanomas has identified hundreds of recurrent mutations in both coding and non-coding DNA. These include a number of well-characterized oncogenic driver mutations, such as coding mutations in the BRAF and NRAS oncogenes, and non-coding mutations in the promoter of telomerase reverse transcriptase ( TERT ). However, the molecular etiology and significance of most of these mutations is unknown. Here, we use a new method known as CPD-capture-seq to map UV-induced cyclobutane pyrimidine dimers (CPDs) with high sequencing depth and single nucleotide resolution at sites of recurrent mutations in melanoma. Our data reveal that many previously identified drivers and other recurrent mutations in melanoma occur at CPD hotspots in UV-irradiated melanocytes, often associated with an overlapping binding site of an E26 transformation-specific (ETS) transcription factor. In contrast, recurrent mutations in the promoters of a number of known or suspected cancer genes are not associated with elevated CPD levels. Our data indicate that a subset of recurrent protein-coding mutations are also likely caused by ETS-induced CPD hotspots. This analysis indicates that ETS proteins profoundly shape the mutation landscape of melanoma and reveals a method for distinguishing potential driver mutations from passenger mutations whose recurrence is due to elevated UV damage. 
    more » « less
  5. Somatic mutations acquired in healthy tissues as we age are major determinants of cancer risk. Whether variants confer a fitness advantage or rise to detectable frequencies by chance remains largely unknown. Blood sequencing data from ~50,000 individuals reveal how mutation, genetic drift, and fitness shape the genetic diversity of healthy blood (clonal hematopoiesis). We show that positive selection, not drift, is the major force shaping clonal hematopoiesis, provide bounds on the number of hematopoietic stem cells, and quantify the fitness advantages of key pathogenic variants, at single-nucleotide resolution, as well as the distribution of fitness effects (fitness landscape) within commonly mutated driver genes. These data are consistent with clonal hematopoiesis being driven by a continuing risk of mutations and clonal expansions that become increasingly detectable with age.

     
    more » « less