skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Approximate Post-Selective Inference for Regression with the Group LASSO
After selection with the Group LASSO (or generalized variants such as the overlapping, sparse, or standardized Group LASSO), inference for the selected parameters is unreliable in the absence of adjustments for selection bias. In the penalized Gaussian regression setup, existing approaches provide adjustments for selection events that can be expressed as linear inequalities in the data variables. Such a representation, however, fails to hold for selection with the Group LASSO and substantially obstructs the scope of subsequent post-selective inference. Key questions of inferential interest—for example, inference for the effects of selected variables on the outcome—remain unanswered. In the present paper, we develop a consistent, post-selective, Bayesian method to address the existing gaps by deriving a likelihood adjustment factor and an approximation thereof that eliminates bias from the selection of groups. Experiments on simulated data and data from the Human Connectome Project demonstrate that our method recovers the effects of parameters within the selected groups while paying only a small price for bias adjustment.  more » « less
Award ID(s):
1951980 2113342
PAR ID:
10543397
Author(s) / Creator(s):
; ;
Publisher / Repository:
https://www.jmlr.org/papers/v24/
Date Published:
Journal Name:
Journal of machine learning research
ISSN:
1533-7928
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Many causal and structural effects depend on regressions. Examples include policy effects, average derivatives, regression decompositions, average treatment effects, causal mediation, and parameters of economic structural models. The regressions may be high‐dimensional, making machine learning useful. Plugging machine learners into identifying equations can lead to poor inference due to bias from regularization and/or model selection. This paper gives automatic debiasing for linear and nonlinear functions of regressions. The debiasing is automatic in using Lasso and the function of interest without the full form of the bias correction. The debiasing can be applied to any regression learner, including neural nets, random forests, Lasso, boosting, and other high‐dimensional methods. In addition to providing the bias correction, we give standard errors that are robust to misspecification, convergence rates for the bias correction, and primitive conditions for asymptotic inference for estimators of a variety of estimators of structural and causal effects. The automatic debiased machine learning is used to estimate the average treatment effect on the treated for the NSW job training data and to estimate demand elasticities from Nielsen scanner data while allowing preferences to be correlated with prices and income. 
    more » « less
  2. Martelli, Pier Luigi (Ed.)
    Abstract MotivationThere is a growing interest in longitudinal omics data paired with some longitudinal clinical outcome. Given a large set of continuous omics variables and some continuous clinical outcome, each measured for a few subjects at only a few time points, we seek to identify those variables that co-vary over time with the outcome. To motivate this problem we study a dataset with hundreds of urinary metabolites along with Tuberculosis mycobacterial load as our clinical outcome, with the objective of identifying potential biomarkers for disease progression. For such data clinicians usually apply simple linear mixed effects models which often lack power given the low number of replicates and time points. We propose a penalized regression approach on the first differences of the data that extends the lasso + Laplacian method [Li and Li (Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics 2008;24:1175–82.)] to a longitudinal group lasso + Laplacian approach. Our method, PROLONG, leverages the first differences of the data to increase power by pairing the consecutive time points. The Laplacian penalty incorporates the dependence structure of the variables, and the group lasso penalty induces sparsity while grouping together all contemporaneous and lag terms for each omic variable in the model. ResultsWith an automated selection of model hyper-parameters, PROLONG correctly selects target metabolites with high specificity and sensitivity across a wide range of scenarios. PROLONG selects a set of metabolites from the real data that includes interesting targets identified during EDA. Availability and implementationAn R package implementing described methods called “prolong” is available at https://github.com/stevebroll/prolong. Code snapshot available at 10.5281/zenodo.14804245. 
    more » « less
  3. Causal inference from observational data has attracted considerable attention among researchers. One main obstacle is the handling of confounders. As direct measurement of confounders may not be feasible, recent methods seek to address the confounding bias via proxy variables, i.e., covariates postulated to be conducive to the inference of latent confounders. However, the selected proxies may scramble both confounders and post-treatment variables in practice, which risks biasing the estimation by controlling for variables affected by the treatment. In this paper, we systematically investigate the bias due to latent post-treatment variables, i.e., latent post-treatment bias, in causal effect estimation. Specifically, we first derive the bias when selected proxies scramble both latent confounders and post-treatment variables, which we demonstrate can be arbitrarily bad. We then propose a Confounder-identifiable VAE (CiVAE) to address the bias. Based on a mild assumption that the prior of latent variables that generate the proxy belongs to a general exponential family with at least one invertible sufficient statistic in the factorized part, CiVAE individually identifies latent confounders and latent post-treatment variables up to bijective transformations. We then prove that with individual identification, the intractable disentanglement problem of latent confounders and post-treatment variables can be transformed into a tractable independence test problem despite arbitrary dependence may exist among them. Finally, we prove that the true causal effects can be unbiasedly estimated with transformed confounders inferred by CiVAE. Experiments on both simulated and real-world datasets demonstrate significantly improved robustness of CiVAE. 
    more » « less
  4. Abstract Standard estimators of the global average treatment effect can be biased in the presence of interference. This paper proposes regression adjustment estimators for removing bias due to interference in Bernoulli randomized experiments. We use a fitted model to predict the counterfactual outcomes of global control and global treatment. Our work differs from standard regression adjustments in that the adjustment variables are constructed from functions of the treatment assignment vector, and that we allow the researcher to use a collection of any functions correlated with the response, turning the problem of detecting interference into a feature engineering problem. We characterize the distribution of the proposed estimator in a linear model setting and connect the results to the standard theory of regression adjustments under SUTVA. We then propose an estimator that allows for flexible machine learning estimators to be used for fitting a nonlinear interference functional form. We propose conducting statistical inference via bootstrap and resampling methods, which allow us to sidestep the complicated dependences implied by interference and instead rely on empirical covariance structures. Such variance estimation relies on an exogeneity assumption akin to the standard unconfoundedness assumption invoked in observational studies. In simulation experiments, our methods are better at debiasing estimates than existing inverse propensity weighted estimators based on neighborhood exposure modeling. We use our method to reanalyze an experiment concerning weather insurance adoption conducted on a collection of villages in rural China. 
    more » « less
  5. Loh, Po-Ling (Ed.)
    When data are distributed across multiple sites or machines rather than centralized in one location, researchers face the challenge of extracting meaningful information without directly sharing individual data points. While there are many distributed methods for point estimation using sparse regression, few options are available for estimating uncertainties or conducting hypothesis tests based on the estimated sparsity. In this paper, we introduce a procedure for performing selective inference with distributed data. We consider a scenario where each local machine solves a lasso problem and communicates the selected predictors to a central machine. The central machine then aggregates these selected predictors to form a generalized linear model (GLM). Our goal is to provide valid inference for the selected GLM while reusing data that have been used in the model selection process. Our proposed procedure only requires low-dimensional summary statistics from local machines, thus keeping communication costs low and preserving the privacy of individual data sets. Furthermore, this procedure can be applied in scenarios where model selection is repeatedly conducted on randomly subsampled data sets, addressing the p-value lottery problem linked with model selection. We demonstrate the effectiveness of our approach through simulations and an analysis of a medical data set on ICU admissions. 
    more » « less