skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Bayesian variable selection on structured logistic-normal mixture models for subgroup analysis
Subgroup analysis has emerged as an important tool to identify unknown subgroup memberships in the presence of heterogeneity. However, much of the existing work focused on the low-dimensional scenario where only a few candidate variables are considered for modeling the subgroup membership. In this paper, we propose a two-component structured mixture model with a Bayesian variable selection approach for identifying predictive and prognostic variables separately in the high-dimensional setting. By employing spike and slab priors, we achieve the selection of predictive and prognostic variables and the estimation of the treatment effect in the selected subgroup simultaneously. We establish theoretical properties by showing strong variable selection consistency and posterior contraction behavior of our method, and demonstrate its performance using simulation studies. Finally, we apply the proposed method to data from the National Supported Work and the AIDS Clinical Trials Group 320 study, identifying predictive and prognostic variables associated with subgroups exhibiting differential treatment effects.  more » « less
Award ID(s):
1943500
PAR ID:
10648663
Author(s) / Creator(s):
 ; ;  ;  
Publisher / Repository:
Electronic Journal of Statistics
Date Published:
Journal Name:
Electronic Journal of Statistics
Volume:
19
Issue:
1
ISSN:
1935-7524
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Detection of prognostic factors associated with patients’ survival outcome helps gain insights into a disease and guide treatment decisions. The rapid advancement of high-throughput technologies has yielded plentiful genomic biomarkers as candidate prognostic factors, but most are of limited use in clinical application. As the price of the technology drops over time, many genomic studies are conducted to explore a common scientific question in different cohorts to identify more reproducible and credible biomarkers. However, new challenges arise from heterogeneity in study populations and designs when jointly analyzing the multiple studies. For example, patients from different cohorts show different demographic characteristics and risk profiles. Existing high-dimensional variable selection methods for survival analysis, however, are restricted to single study analysis. We propose a novel Cox model based two-stage variable selection method called “Cox-TOTEM” to detect survival-associated biomarkers common in multiple genomic studies. Simulations showed our method greatly improved the sensitivity of variable selection as compared to the separate applications of existing methods to each study, especially when the signals are weak or when the studies are heterogeneous. An application of our method to TCGA transcriptomic data identified essential survival associated genes related to the common disease mechanism of five Pan-Gynecologic cancers. 
    more » « less
  2. Categorical data analysis becomes challenging when high-dimensional sparse covariates are involved, which is often the case for omics data. We introduce a statistical procedure based on multinomial logistic regression analysis for such scenarios, including variable screening, model selection, order selection for response categories, and variable selection. We perform our procedure on high-dimensional gene expression data with 801 patients, 2426 genes, and five types of cancerous tumors. As a result, we recommend three finalized models: one with 74 genes achieves extremely low cross-entropy loss and zero predictive error rate based on a five-fold cross-validation; and two other models with 31 and 4 genes, respectively, are recommended for prognostic multi-gene signatures. 
    more » « less
  3. null (Ed.)
    Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method. 
    more » « less
  4. null (Ed.)
    In the last few decades, various spectroscopic soft sensors that predict sample properties from its spectroscopic readings have been reported. To improve prediction performance, variable selection that aims to eliminate irrelevant wavelengths is often performed prior to soft sensor model building. However, due to the data-driven nature of many variable selection methods, they can be sensitive to the choice of the training data, and oftentimes the selected wavelengths show little connection to the underlying chemical bonds or function groups that determine the property of the sample. To address these limitations, we proposed a new variable selection method, namely consistency enhanced evolution for variable selection (CEEVS), which focuses on identifying the variables that are consistently selected from different training dataset. To demonstrate the effectiveness and robustness of CEEVS, we compared it with three representative variable selection methods using two published NIR datasets. We show that by identifying variables with high selection consistency, CEEVS not only achieves improved soft sensor performance, but also identifies key chemical information from spectroscopic data. 
    more » « less
  5. Summary In some randomized clinical trials, patients may die before the measurement time point of their outcomes. Even though randomization generates comparable treatment and control groups, the remaining survivors often differ significantly in background variables that are prognostic to the outcomes. This is called the truncation by death problem. Under the potential outcomes framework, the only well-defined causal effect on the outcome is within the subgroup of patients who would always survive under both treatment and control. Because the definition of the subgroup depends on the potential values of the survival status that could not be observed jointly, without making strong parametric assumptions, we cannot identify the causal effect of interest and consequently can only obtain bounds of it. Unfortunately, however, many bounds are too wide to be useful. We propose to use detailed survival information before and after the measurement time point of the outcomes to sharpen the bounds of the subgroup causal effect. Because survival times contain useful information about the final outcome, carefully utilizing them could improve statistical inference without imposing strong parametric assumptions. Moreover, we propose to use a copula model to relax the commonly-invoked but often doubtful monotonicity assumption that the treatment extends the survival time for all patients. 
    more » « less