skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Uncovering biomarker genes with enriched classification potential from Hallmark gene sets
Given the complex relationship between gene expression and phenotypic outcomes, computationally efficient approaches are needed to sift through large high-dimensional datasets in order to identify biologically relevant biomarkers. In this report, we describe a method of identifying the most salient biomarker genes in a dataset, which we call "candidate genes", by evaluating the ability of gene combinations to classify samples from a dataset, which we call "classification potential". Our algorithm, Gene Oracle, uses a neural network to test user defined gene sets for polygenic classification potential and then uses a combinatorial approach to further decompose selected gene sets into candidate and non-candidate biomarker genes. We tested this algorithm on curated gene sets from the Molecular Signatures Database (MSigDB) quantified in RNAseq gene expression matrices obtained from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) data repositories. First, we identified which MSigDB Hallmark subsets have significant classification potential for both the TCGA and GTEx datasets. Then, we identified the most discriminatory candidate biomarker genes in each Hallmark gene set and provide evidence that the improved biomarker potential of these genes may be due to reduced functional complexity.  more » « less
Award ID(s):
1659300
PAR ID:
10132578
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Scientific reports
Volume:
9
Issue:
9747
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Given the complex relationship between gene expression and phenotypic outcomes, computationally efficient approaches are needed to sift through large high-dimensional datasets in order to identify biologically relevant biomarkers. In this report, we describe a method of identifying the most salient biomarker genes in a dataset, which we call “candidate genes”, by evaluating the ability of gene combinations to classify samples from a dataset, which we call “classification potential”. Our algorithm, Gene Oracle, uses a neural network to test user defined gene sets for polygenic classification potential and then uses a combinatorial approach to further decompose selected gene sets into candidate and non-candidate biomarker genes. We tested this algorithm on curated gene sets from the Molecular Signatures Database (MSigDB) quantified in RNAseq gene expression matrices obtained from The Cancer Genome Atlas (TCGA) and Genotype-Tissue Expression (GTEx) data repositories. First, we identified which MSigDB Hallmark subsets have significant classification potential for both the TCGA and GTEx datasets. Then, we identified the most discriminatory candidate biomarker genes in each Hallmark gene set and provide evidence that the improved biomarker potential of these genes may be due to reduced functional complexity. 
    more » « less
  2. Abstract The human brain is a complex organ that consists of several regions each with a unique gene expression pattern. Our intent in this study was to construct a gene co-expression network (GCN) for the normal brain using RNA expression profiles from the Genotype-Tissue Expression (GTEx) project. The brain GCN contains gene correlation relationships that are broadly present in the brain or specific to thirteen brain regions, which we later combined into six overarching brain mini-GCNs based on the brain’s structure. Using the expression profiles of brain region-specific GCN edges, we determined how well the brain region samples could be discriminated from each other, visually with t-SNE plots or quantitatively with the Gene Oracle deep learning classifier. Next, we tested these gene sets on their relevance to human tumors of brain and non-brain origin. Interestingly, we found that genes in the six brain mini-GCNs showed markedly higher mutation rates in tumors relative to matched sets of random genes. Further, we found that cortex genes subdivided Head and Neck Squamous Cell Carcinoma (HNSC) tumors and Pheochromocytoma and Paraganglioma (PCPG) tumors into distinct groups. The brain GCN and mini-GCNs are useful resources for the classification of brain regions and identification of biomarker genes for brain related phenotypes. 
    more » « less
  3. Abstract Histopathological images are used to characterize complex phenotypes such as tumor stage. Our goal is to associate features of stained tissue images with high-dimensional genomic markers. We use convolutional autoencoders and sparse canonical correlation analysis (CCA) on paired histological images and bulk gene expression to identify subsets of genes whose expression levels in a tissue sample correlate with subsets of morphological features from the corresponding sample image. We apply our approach, ImageCCA, to two TCGA data sets, and find gene sets associated with the structure of the extracellular matrix and cell wall infrastructure, implicating uncharacterized genes in extracellular processes. We find sets of genes associated with specific cell types, including neuronal cells and cells of the immune system. We apply ImageCCA to the GTEx v6 data, and find image features that capture population variation in thyroid and in colon tissues associated with genetic variants (image morphology QTLs, or imQTLs), suggesting that genetic variation regulates population variation in tissue morphological traits. 
    more » « less
  4. Abstract Biomarkers predictive of drug-specific outcomes are important tools for personalized medicine. In this study, we present an integrative analysis to identify miRNAs that are predictive of drug-specific survival outcome in cancer. Using the clinical data from TCGA, we defined subsets of cancer patients who suffered from the same cancer and received the same drug treatment, which we call cancer-drug groups. We then used the miRNA expression data in TCGA to evaluate each miRNA’s ability to predict the survival outcome of patients in each cancer-drug group. As a result, the identified miRNAs are predictive of survival outcomes in a cancer-specific and drug-specific manner. Notably, most of the drug-specific miRNA survival markers and their target genes showed consistency in terms of correlations in their expression and their correlations with survival. Some of the identified miRNAs were supported by published literature in contexts of various cancers. We explored several additional breast cancer datasets that provided miRNA expression and survival data, and showed that our drug-specific miRNA survival markers for breast cancer were able to effectively stratify the prognosis of patients in those additional datasets. Together, this analysis revealed drug-specific miRNA markers for cancer survival, which can be promising tools toward personalized medicine. 
    more » « less
  5. Abstract Although multiple high-performing epigenetic aging clocks exist, few are based directly on gene expression. Such transcriptomic aging clocks allow us to extract age-associated genes directly. However, most existing transcriptomic clocks model a subset of genes and are limited in their ability to predict novel biomarkers. With the growing popularity of single-cell sequencing, there is a need for robust single-cell transcriptomic aging clocks. Moreover, clocks have yet to be applied to investigate the elusive phenomenon of sex differences in aging. We introduce TimeFlies, a pan-cell-type scRNA-seq aging clock for theDrosophila melanogasterhead. TimeFlies uses deep learning to classify the donor age of cells based on genome-wide gene expression profiles. Using explainability methods, we identified key marker genes contributing to the classification, with lncRNAs showing up as highly enriched among predicted biomarkers. The top biomarker gene across cell types is lncRNA:roX1, a regulator of X chromosome dosage compensation, a pathway previously identified as a top biomarker of aging in the mouse brain. We validated this finding experimentally, showing a decrease in survival probability in the absence of roX1in vivo. Furthermore, we trained sex-specific TimeFlies clocks and noted significant differences in model predictions and explanations between male and female clocks, suggesting that different pathways drive aging in males and females. Graphical Abstract 
    more » « less