skip to main content


Title: Variable Selection via Additive Conditional Independence
Summary

We propose a non-parametric variable selection method which does not rely on any regression model or predictor distribution. The method is based on a new statistical relationship, called additive conditional independence, that has been introduced recently for graphical models. Unlike most existing variable selection methods, which target the mean of the response, the method proposed targets a set of attributes of the response, such as its mean, variance or entire distribution. In addition, the additive nature of this approach offers non-parametric flexibility without employing multi-dimensional kernels. As a result it retains high accuracy for high dimensional predictors. We establish estimation consistency, convergence rate and variable selection consistency of the method proposed. Through simulation comparisons we demonstrate that the method proposed performs better than existing methods when the predictor affects several attributes of the response, and it performs competently in the classical setting where the predictors affect the mean only. We apply the new method to a data set concerning how gene expression levels affect the weight of mice.

 
more » « less
PAR ID:
10397457
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
78
Issue:
5
ISSN:
1369-7412
Format(s):
Medium: X Size: p. 1037-1055
Size(s):
p. 1037-1055
Sponsoring Org:
National Science Foundation
More Like this
  1. Motivated by mobile devices that record data at a high frequency, we propose a new methodological framework for analyzing a semi-parametric regression model that allow us to study a nonlinear relationship between a scalar response and multiple functional predictors in the presence of scalar covariates. Utilizing functional principal component analysis (FPCA) and the least-squares kernel machine method (LSKM), we are able to substantially extend the framework of semi-parametric regression models of scalar responses on scalar predictors by allowing multiple functional predictors to enter the nonlinear model. Regularization is established for feature selection in the setting of reproducing kernel Hilbert spaces. Our method performs simultaneously model fitting and variable selection on functional features. For the implementation, we propose an effective algorithm to solve related optimization problems in that iterations take place between both linear mixed-effects models and a variable selection method (e.g., sparse group lasso). We show algorithmic convergence results and theoretical guarantees for the proposed methodology. We illustrate its performance through simulation experiments and an analysis of accelerometer data. 
    more » « less
  2. Abstract

    Maternal exposure to environmental chemicals during pregnancy can alter birth and children's health outcomes. Research seeks to identify critical windows, time periods when exposures can change future health outcomes, and estimate the exposure–response relationship. Existing statistical approaches focus on estimation of the association between maternal exposure to a single environmental chemical observed at high temporal resolution (e.g., weekly throughout pregnancy) and children's health outcomes. Extending to multiple chemicals observed at high temporal resolution poses a dimensionality problem and statistical methods are lacking. We propose a regression tree–based model for mixtures of exposures observed at high temporal resolution. The proposed approach uses an additive ensemble of tree pairs that defines structured main effects and interactions between time‐resolved predictors and performs variable selection to select out of the model predictors not correlated with the outcome. In simulation, we show that the tree‐based approach performs better than existing methods for a single exposure and can accurately estimate critical windows in the exposure–response relation for mixtures. We apply our method to estimate the relationship between five exposures measured weekly throughout pregnancy and birth weight in a Denver, Colorado, birth cohort. We identified critical windows during which fine particulate matter, sulfur dioxide, and temperature are negatively associated with birth weight and an interaction between fine particulate matter and temperature. Software is made available in the R package dlmtree.

     
    more » « less
  3. ABSTRACT

    In this paper, we propose Varying Effects Regression with Graph Estimation (VERGE), a novel Bayesian method for feature selection in regression. Our model has key aspects that allow it to leverage the complex structure of data sets arising from genomics or imaging studies. We distinguish between the predictors, which are the features utilized in the outcome prediction model, and the subject-level covariates, which modulate the effects of the predictors on the outcome. We construct a varying coefficients modeling framework where we infer a network among the predictor variables and utilize this network information to encourage the selection of related predictors. We employ variable selection spike-and-slab priors that enable the selection of both network-linked predictor variables and covariates that modify the predictor effects. We demonstrate through simulation studies that our method outperforms existing alternative methods in terms of both feature selection and predictive accuracy. We illustrate VERGE with an application to characterizing the influence of gut microbiome features on obesity, where we identify a set of microbial taxa and their ecological dependence relations. We allow subject-level covariates, including sex and dietary intake variables to modify the coefficients of the microbiome predictors, providing additional insight into the interplay between these factors.

     
    more » « less
  4. ABSTRACT

    We address the challenge of estimating regression coefficients and selecting relevant predictors in the context of mixed linear regression in high dimensions, where the number of predictors greatly exceeds the sample size. Recent advancements in this field have centered on incorporating sparsity-inducing penalties into the expectation-maximization (EM) algorithm, which seeks to maximize the conditional likelihood of the response given the predictors. However, existing procedures often treat predictors as fixed or overlook their inherent variability. In this paper, we leverage the independence between the predictor and the latent indicator variable of mixtures to facilitate efficient computation and also achieve synergistic variable selection across all mixture components. We establish the non-asymptotic convergence rate of the proposed fast group-penalized EM estimator to the true regression parameters. The effectiveness of our method is demonstrated through extensive simulations and an application to the Cancer Cell Line Encyclopedia dataset for the prediction of anticancer drug sensitivity.

     
    more » « less
  5. Abstract

    The analysis of time series data with detection limits is challenging due to the high‐dimensional integral involved in the likelihood. Existing methods are either computationally demanding or rely on restrictive parametric distributional assumptions. We propose a semiparametric approach, where the temporal dependence is captured by parametric copula, while the marginal distribution is estimated non‐parametrically. Utilizing the properties of copulas, we develop a new copula‐based sequential sampling algorithm, which provides a convenient way to calculate the censored likelihood. Even without full parametric distributional assumptions, the proposed method still allows us to efficiently compute the conditional quantiles of the censored response at a future time point, and thus construct both point and interval predictions. We establish the asymptotic properties of the proposed pseudo maximum likelihood estimator, and demonstrate through simulation and the analysis of a water quality data that the proposed method is more flexible and leads to more accurate predictions than Gaussian‐based methods for non‐normal data.The Canadian Journal of Statistics47: 438–454; 2019 © 2019 Statistical Society of Canada

     
    more » « less