skip to main content


Title: Application of Feature Selection and Deep Learning for Cancer Prediction Using DNA Methylation Markers

DNA methylation is a process that can affect gene accessibility and therefore gene expression. In this study, a machine learning pipeline is proposed for the prediction of breast cancer and the identification of significant genes that contribute to the prediction. The current study utilized breast cancer methylation data from The Cancer Genome Atlas (TCGA), specifically the TCGA-BRCA dataset. Feature engineering techniques have been utilized to reduce data volume and make deep learning scalable. A comparative analysis of the proposed approach on Illumina 27K and 450K methylation data reveals that deep learning methodologies for cancer prediction can be coupled with feature selection models to enhance prediction accuracy. Prediction using 450K methylation markers can be accomplished in less than 13 s with an accuracy of 98.75%. Of the list of 685 genes in the feature selected 27K dataset, 578 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in five biological processes and one molecular function. Of the list of 1572 genes in the feature selected 450K data set, 1290 were mapped to Ensemble Gene IDs. This reduced set was significantly (FDR < 0.05) enriched in 95 biological processes and 17 molecular functions. Seven oncogene/tumor suppressor genes were common between the 27K and 450K feature selected gene sets. These genes were RTN4IP1, MYO18B, ANP32A, BRF1, SETBP1, NTRK1, and IGF2R. Our bioinformatics deep learning workflow, incorporating imputation and data balancing methods, is able to identify important methylation markers related to functionally important genes in breast cancer with high accuracy compared to deep learning or statistical models alone.

 
more » « less
Award ID(s):
1920220
PAR ID:
10549439
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
MDPI
Date Published:
Journal Name:
Genes
Volume:
13
Issue:
9
ISSN:
2073-4425
Page Range / eLocation ID:
1557
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cancer is a complex disease associated with abnormal DNA mutations. Not all tumors are cancerous and not all cancers are the same. Correct cancer type diagnosis can indicate the most effective drug therapy and increase survival rate. At the molecular level, it has been shown that cancer type classification can be carried out from the analysis of somatic point mutation. However, the high dimensionality and sparsity of genomic mutation data, coupled with its small sample size has been a hindrance in accurate classification of cancer. We address these problems by introducing a novel classification method called mClass that accounts for the sparsity of the data. mClass is a feature selection method that ranks genes based on their similarity across samples and employs their normalized mutual information to determine the set of genes that provide optimal classification accuracy. Experimental results on TCGA datasets show that mClass significantly improves testing accuracy compared to DeepGene, which is the state-of-the-art in cancer-type classification based on somatic mutation data. In addition, when compared with other cancer gene prediction tools, the set of genes selected by mClass contains the highest number of genes in top 100 genes listed in the Cancer Gene Census. mClass is available at https://github.com/mdahasan/mClass. 
    more » « less
  2. Abstract Motivation Detecting cancer gene expression and transcriptome changes with mRNA-sequencing (RNA-Seq) or array-based data are important for understanding the molecular mechanisms underlying carcinogenesis and cellular events during cancer progression. In previous studies, the differentially expressed genes were detected across patients in one cancer type. These studies ignored the role of mRNA expression changes in driving tumorigenic mechanisms that are either universal or specific in different tumor types. To address the problem, we introduce two network-based multi-task learning frameworks, NetML and NetSML, to discover common differentially expressed genes shared across different cancer types as well as differentially expressed genes specific to each cancer type. The proposed frameworks consider the common latent gene co-expression modules and gene-sample biclusters underlying the multiple cancer datasets to learn the knowledge crossing different tumor types. Results Large-scale experiments on simulations and real cancer high-throughput datasets validate that the proposed network-based multi-task learning frameworks perform better sample classification compared with the models without the knowledge sharing across different cancer types. The common and cancer specific molecular signatures detected by multi-task learning frameworks on TCGA ovarian cancer, breast cancer, and prostate cancer datasets are correlated with the known marker genes and enriched in cancer relevant KEGG pathways and Gene Ontology terms. Availability and Implementation Source code is available at: https://github.com/compbiolabucf/NetML Supplementary information Supplementary data are available at Bioinformatics 
    more » « less
  3. Background:

    DNA methylation is a form of epigenetic modification that has been shown to play a significant role in gene regulation. In cancer, DNA methylation plays an important role by regulating the expression of oncogenes. The role of DNA methylation in the onset and progression of various cancer types is now being elucidated as more large-scale data become available. The Cancer Genome Atlas (TCGA) provides a wealth of information for the analysis of various molecular aspects of cancer genetics. Gene expression data and DNA methylation data from TCGA have been used for a variety of studies. A traditional understanding of the effects of DNA methylation on gene expression has linked methylation of CpG sites in the gene promoter region with the decrease in gene expression. Recent studies have begun to expand this traditional role of DNA methylation.

    Results:

    Here we present a pan-cancer analysis of correlation patterns between CpG methylation and gene expression. Using matching patient data from TCGA, 33 cancer-specific correlations were calculated for each CpG site and the expression level of its corresponding gene. These correlations were used to identify patterns on a per-site basis as well as patterns of methylation across the gene body. Using these identified patterns, we found genes that contain conflicting methylation signals beyond the commonly accepted association between the promoter region methylation and silencing of gene expression. Beyond gene body methylation in whole, we examined individual CpG sites and show that, even in the same gene body, some sites can have a contradictory effect on gene expression in cancers.

    Conclusions:

    We observed that within promoter regions there was a substantial amount of positive correlation between methylation and gene expression, which contradicts the commonly accepted association. We observed that the correlation between CpG methylation and gene expression does not exhibit in a tissue-specific manner, suggesting that the effects of methylation on gene expression are largely tissue independent. The analysis of correlation associated with the location of the CpG site in the gene body has led to the identification of several different methylation patterns that affect gene expression, and several examples of methylation activating gene expression were observed. Distinctly opposing or conflicting effects were seen in close proximity on the gene body, where negative and positive correlations were seen at the neighboring CpG sites.

     
    more » « less
  4. Abstract Biomarkers predictive of drug-specific outcomes are important tools for personalized medicine. In this study, we present an integrative analysis to identify miRNAs that are predictive of drug-specific survival outcome in cancer. Using the clinical data from TCGA, we defined subsets of cancer patients who suffered from the same cancer and received the same drug treatment, which we call cancer-drug groups. We then used the miRNA expression data in TCGA to evaluate each miRNA’s ability to predict the survival outcome of patients in each cancer-drug group. As a result, the identified miRNAs are predictive of survival outcomes in a cancer-specific and drug-specific manner. Notably, most of the drug-specific miRNA survival markers and their target genes showed consistency in terms of correlations in their expression and their correlations with survival. Some of the identified miRNAs were supported by published literature in contexts of various cancers. We explored several additional breast cancer datasets that provided miRNA expression and survival data, and showed that our drug-specific miRNA survival markers for breast cancer were able to effectively stratify the prognosis of patients in those additional datasets. Together, this analysis revealed drug-specific miRNA markers for cancer survival, which can be promising tools toward personalized medicine. 
    more » « less
  5. Public genomic repositories are notoriously lacking in racially and ethnically diverse samples. This limits the reaches of exploration and has in fact been one of the driving factors for the initiation of the All of Us project. Our particular focus here is to provide a model-based framework for accurately predicting DNA methylation from genetic data using racially sparse public repository data. Epigenetic alterations are of great interest in cancer research but public repository data is limited in the information it provides. However, genetic data is more plentiful. Our phenotype of interest is cervical cancer in The Cancer Genome Atlas (TCGA) repository. Being able to generate such predictions would nicely complement other work that has generated gene-level predictions of gene expression for normal samples. We develop a new prediction approach which uses shared random effects from a nested error mixed effects regression model. The sharing of random effects allows borrowing of strength across racial groups greatly improving predictive accuracy. Additionally, we show how to further borrow strength by combining data from different cancers in TCGA even though the focus of our predictions is DNA methylation in cervical cancer. We compare our methodology against other popular approaches including the elastic net shrinkage estimator and random forest prediction. Results are very encouraging with the shared classified random effects approach uniformly producing more accurate predictions – overall and for each racial group. 
    more » « less