skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Outcome-guided disease subtyping by generative model and weighted joint likelihood in transcriptomic applications
With advances in high-throughput technology, molecular disease subtyping by high-dimensional omics data has been recognized as an effective approach for identifying subtypes of complex diseases with distinct disease mechanisms and prognoses. Conventional cluster analysis takes omics data as input and generates patient clusters with similar gene expression pattern. The omics data, however, usually contain multifaceted cluster structures that can be defined by different sets of genes. If the gene set associated with irrelevant clinical variables (e.g., sex or age) dominates the clustering process, the resulting clusters may not capture clinically meaningful disease subtypes. This motivates the development of a clustering framework with guidance from a prespecified disease outcome, such as lung function measurement or survival, in this paper. We propose two disease subtyping methods by omics data with outcome guidance using a generative model or a weighted joint likelihood. Both methods connect an outcome association model and a disease subtyping model by a latent variable of cluster labels. Compared to the generative model, weighted joint likelihood contains a data-driven weight parameter to balance the likelihood contributions from outcome association and gene cluster separation, which improves generalizability in independent validation but requires heavier computing. Extensive simulations and two real applications in lung disease and triple-negative breast cancer demonstrate superior disease subtyping performance of the outcome-guided clustering methods in terms of disease subtyping accuracy, gene selection and outcome association. Unlike existing clustering methods, the outcome-guided disease subtyping framework creates a new precision medicine paradigm to directly identify patient subgroups with clinical association.  more » « less
Award ID(s):
2113568
PAR ID:
10608002
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
Institute of Mathematical Statistics
Date Published:
Journal Name:
The Annals of Applied Statistics
Volume:
18
Issue:
3
ISSN:
1932-6157
Page Range / eLocation ID:
1947–1964
Subject(s) / Keyword(s):
Disease subtyping omics data high-dimensional cluster analysis generative model weighted joint likelihood.
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cancer is an umbrella term that includes a range of disorders, from those that are fast-growing and lethal to indolent lesions with low or delayed potential for progression to death. The treatment options, as well as treatment success, are highly dependent on the correct subtyping of individual patients. With the advancement of high-throughput platforms, we have the opportunity to differentiate among cancer subtypes from a holistic perspective that takes into consideration phenomena at different molecular levels (mRNA, methylation, etc.). This demands powerful integrative methods to leverage large multi-omics datasets for a better subtyping. Here we introduce Subtyping Multi-omics using a Randomized Transformation (SMRT), a new method for multi-omics integration and cancer subtyping. SMRT offers the following advantages over existing approaches: (i) the scalable analysis pipeline allows researchers to integrate multi-omics data and analyze hundreds of thousands of samples in minutes, (ii) the ability to integrate data types with different numbers of patients, (iii) the ability to analyze un-matched data of different types, and (iv) the ability to offer users a convenient data analysis pipeline through a web application. We also improve the efficiency of our ensemble-based, perturbation clustering to support analysis on machines with memory constraints. In an extensive analysis, we compare SMRT with eight state-of-the-art subtyping methods using 37 TCGA and two METABRIC datasets comprising a total of almost 12,000 patient samples from 28 different types of cancer. We also performed a number of simulation studies. We demonstrate that SMRT outperforms other methods in identifying subtypes with significantly different survival profiles. In addition, SMRT is extremely fast, being able to analyze hundreds of thousands of samples in minutes. The web application is available at http://SMRT.tinnguyen-lab.com . The R package will be deposited to CRAN as part of our PINSPlus software suite. 
    more » « less
  2. The rapid growth of diverse -omics datasets has made multiomics data integration crucial in cancer research. This study adapts the expectation–maximization routine for the joint latent variable modeling of multiomics patient profiles. By combining this approach with traditional biological feature selection methods, this study optimizes latent distribution, enabling efficient patient clustering from well-studied cancer types with reduced computational expense. The proposed optimization subroutines enhance survival analysis and improve runtime performance. This article presents a framework for distinguishing cancer subtypes and identifying potential biomarkers for breast cancer. Key insights into individual subtype expression and function were obtained through differentially expressed gene analysis and pathway enrichment for BRCA patients. The analysis compared 302 tumor samples to 113 normal samples across 60,660 genes. The highly upregulated gene COL10A1, promoting breast cancer progression and poor prognosis, and the consistently downregulated gene CDG300LG, linked to brain metastatic cancer, were identified. Pathway enrichment analysis revealed similarities in cellular matrix organization pathways across subtypes, with notable differences in functions like cell proliferation regulation and endocytosis by host cells. GO Semantic Similarity analysis quantified gene relationships in each subtype, identifying potential biomarkers like MATN2, similar to COL10A1. These insights suggest deeper relationships within clusters and highlight personalized treatment potential based on subtypes. 
    more » « less
  3. Heterogeneity among Alzheimer’s disease (AD) patients confounds clinical trial patient selection and therapeutic efficacy evaluation. This work defines separable AD clinical sub-populations using unsupervised machine learning. Clustering (t-SNE followed by k-means) of patient features and association rule mining (ARM) was performed on the ADNIMERGE dataset from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). Patient sociodemographics, brain imaging, biomarkers, cognitive tests, and medication usage were included for analysis. Four AD clinical sub-populations were identified using between-cluster mean fold changes [cognitive performance, brain volume]: cluster-1 represented least severe disease [+17.3, +13.3]; cluster-0 [−4.6, +3.8] and cluster-3 [+10.8, −4.9] represented mid-severity sub-populations; cluster-2 represented most severe disease [−18.4, −8.4]. ARM assessed frequently occurring pharmacologic substances within the 4 sub-populations. No drug class was associated with the least severe AD (cluster-1), likely due to lesser antecedent disease. Anti-hyperlipidemia drugs associated with cluster-0 (mid-severity, higher volume). Interestingly, antioxidants vitamin C and E associated with cluster-3 (mid-severity, higher cognition). Anti-depressants like Zoloft associated with most severe disease (cluster-2). Vitamin D is protective for AD, but ARM identified significant underutilization across all AD sub-populations. Identification and feature characterization of four distinct AD sub-population “clusters” using standard clinical features enhances future clinical trial selection criteria and cross-study comparative analysis. 
    more » « less
  4. Abstract Proteins are direct products of the genome and metabolites are functional products of interactions between the host and other factors such as environment, disease state, clinical information, etc. Omics data, including proteins and metabolites, are useful in characterizing biological processes underlying COVID-19 along with patient data and clinical information, yet few methods are available to effectively analyze such diverse and unstructured data. Using an integrated approach that combines proteomics and metabolomics data, we investigated the changes in metabolites and proteins in relation to patient characteristics (e.g., age, gender, and health outcome) and clinical information (e.g., metabolic panel and complete blood count test results). We found significant enrichment of biological indicators of lung, liver, and gastrointestinal dysfunction associated with disease severity using publicly available metabolite and protein profiles. Our analyses specifically identified enriched proteins that play a critical role in responses to injury or infection within these anatomical sites, but may contribute to excessive systemic inflammation within the context of COVID-19. Furthermore, we have used this information in conjunction with machine learning algorithms to predict the health status of patients presenting symptoms of COVID-19. This work provides a roadmap for understanding the biochemical pathways and molecular mechanisms that drive disease severity, progression, and treatment of COVID-19. 
    more » « less
  5. null (Ed.)
    Abstract Motivation While each cancer is the result of an isolated evolutionary process, there are repeated patterns in tumorigenesis defined by recurrent driver mutations and their temporal ordering. Such repeated evolutionary trajectories hold the potential to improve stratification of cancer patients into subtypes with distinct survival and therapy response profiles. However, current cancer phylogeny methods infer large solution spaces of plausible evolutionary histories from the same sequencing data, obfuscating repeated evolutionary patterns. Results To simultaneously resolve ambiguities in sequencing data and identify cancer subtypes, we propose to leverage common patterns of evolution found in patient cohorts. We first formulate the Multiple Choice Consensus Tree problem, which seeks to select a tumor tree for each patient and assign patients into clusters in such a way that maximizes consistency within each cluster of patient trees. We prove that this problem is NP-hard and develop a heuristic algorithm, Revealing Evolutionary Consensus Across Patients (RECAP), to solve this problem in practice. Finally, on simulated data, we show RECAP outperforms existing methods that do not account for patient subtypes. We then use RECAP to resolve ambiguities in patient trees and find repeated evolutionary trajectories in lung and breast cancer cohorts. Availability and implementation https://github.com/elkebir-group/RECAP. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less