Estimating causal effects under exogeneity hinges on two key assumptions: unconfoundedness and overlap. Researchers often argue that unconfoundedness is more plausible when more covariates are included in the analysis. Less discussed is the fact that covariate overlap is more difficult to satisfy in this setting. In this paper, we explore the implications of overlap in observational studies with high-dimensional covariates and formalize curse-of-dimensionality argument, suggesting that these assumptions are stronger than investigators likely realize. Our key innovation is to explore how strict overlap restricts global discrepancies between the covariate distributions in the treated and control populations. Exploiting results from information theory, we derive explicit bounds on the average imbalance in covariate means under strict overlap and show that these bounds become more restrictive as the dimension grows large. We discuss how these implications interact with assumptions and procedures commonly deployed in observational causal inference, including sparsity and trimming. 
                        more » 
                        « less   
                    
                            
                            A Gaussian Process Framework for Overlap and Causal Effect Estimation with High-Dimensional Covariates
                        
                    
    
            A powerful tool for the analysis of nonrandomized observational studies has been the potential outcomes model. Utilization of this framework allows analysts to estimate average treatment effects. This article considers the situation in which high-dimensional covariates are present and revisits the standard assumptions made in causal inference. We show that by employing a flexible Gaussian process framework, the assumption of strict overlap leads to very restrictive assumptions about the distribution of covariates, results for which can be characterized using classical results from Gaussian random measures as well as reproducing kernel Hilbert space theory. In addition, we propose a strategy for data-adaptive causal effect estimation that does not rely on the strict overlap assumption. These findings reveal under a focused framework the stringency that accompanies the use of the treatment positivity assumption in high-dimensional settings. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1914937
- PAR ID:
- 10198388
- Date Published:
- Journal Name:
- Journal of causal inference
- ISSN:
- 2193-3677
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract: The regression discontinuity (RD) design is a commonly used non-experimental approach for evaluating policy or program effects. However, this approach heavily relies on the untestable assumption that distribution of confounders or average potential outcomes near or at the cutoff are comparable. When there are multiple cutoffs that create several discontinuities in the treatment assignments, factors that can lead this assumption to the failure at one cutoff may overlap with those at other cutoffs, invalidating the causal effects from each cutoff. In this study, we propose a novel approach for testing the causal hypothesis of no RD treatment effect that can remain valid even when the assumption commonly considered in the RD design does not hold. We propose leveraging variations in multiple available cutoffs and constructing a set of instrumental variables (IVs). We then combine the evidence from multiple IVs with a direct comparison under the local randomization framework. This reinforced design that combines multiple factors from a single data can produce several, nearly independent inferential results that depend on very different assumptions with each other. Our proposed approach can be extended to a fuzzy RD design. We apply our method to evaluate the effect of having access to higher achievement schools on students' academic performances in Romania.more » « less
- 
            Matching is one of the simplest approaches for estimating causal effects from observational data. Matching techniques compare the observed outcomes across pairs of individuals with similar covariate values but different treatment statuses in order to estimate causal effects. However, traditional matching techniques are unreliable given high-dimensional covariates due to the infamous curse of dimensionality. To overcome this challenge, we propose a simple, fast, yet highly effective approach to matching using Random Hyperplane Tessellations (RHPT). First, we prove that the RHPT representation is an approximate balancing score – thus maintaining the strong ignorability assumption – and provide empirical evidence for this claim. Second, we report results of extensive experiments showing that matching using RHPT outperforms traditional matching techniques and is competitive with state-of-the-art deep learning methods for causal effect estimation. In addition, RHPT avoids the need for computationally expensive training of deep neural networks.more » « less
- 
            We consider causal inference for observational studies with data spread over two files. One file includes the treatment, outcome, and some covariates measured on a set of individuals, and the other file includes additional causally-relevant covariates measured on a partially overlapping set of individuals. By linking records in the two databases, the analyst can control for more covariates, thereby reducing the risk of bias compared to using only one file alone. When analysts do not have access to a unique identifier that enables perfect, error-free linkages, they typically rely on probabilistic record linkage to construct a single linked data set, and estimate causal effects using these linked data. This typical practice does not propagate uncertainty from imperfect linkages to the causal inferences. Further, it does not take advantage of relationships among the variables to improve the linkage quality. We address these shortcomings by fusing regression-assisted, Bayesian probabilistic record linkage with causal inference. The Markov chain Monte Carlo sampler generates multiple plausible linked data files as byproducts that analysts can use for multiple imputation inferences. Here, we show results for two causal estimators based on propensity score overlap weights. Using simulations and data from the Italy Survey on Household Income and Wealth, we show that our approach can improve the accuracy of estimated treatment effects.more » « less
- 
            It is common to quantify causal effects with mean values, which, however, may fail to capture significant distribution differences of the outcome under different treatments. We study the problem of estimating the density of the causal effect of a binary treatment on a continuous outcome given a binary instrumental variable in the presence of covariates. Specifically, we consider the local treatment effect, which measures the effect of treatment among those who comply with the assignment under the assumption of monotonicity (only the ones who were offered the treatment take it). We develop two families of methods for this task, kernel-smoothing and model-based approximations -- the former smoothes the density by convoluting with a smooth kernel function; the latter projects the density onto a finite-dimensional density class. For both approaches, we derive double/debiased machine learning (DML) based estimators. We study the asymptotic convergence rates of the estimators and show that they are robust to the biases in nuisance function estimation. We illustrate the proposed methods on synthetic data and a real dataset called 401(k).more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    