Many causal and structural effects depend on regressions. Examples include policy effects, average derivatives, regression decompositions, average treatment effects, causal mediation, and parameters of economic structural models. The regressions may be high‐dimensional, making machine learning useful. Plugging machine learners into identifying equations can lead to poor inference due to bias from regularization and/or model selection. This paper gives automatic debiasing for linear and nonlinear functions of regressions. The debiasing is automatic in using Lasso and the function of interest without the full form of the bias correction. The debiasing can be applied to any regression learner, including neural nets, random forests, Lasso, boosting, and other high‐dimensional methods. In addition to providing the bias correction, we give standard errors that are robust to misspecification, convergence rates for the bias correction, and primitive conditions for asymptotic inference for estimators of a variety of estimators of structural and causal effects. The automatic debiased machine learning is used to estimate the average treatment effect on the treated for the NSW job training data and to estimate demand elasticities from Nielsen scanner data while allowing preferences to be correlated with prices and income.
more »
« less
Regression-based causal inference with factorial experiments: estimands, model specifications and design-based properties
Summary Factorial designs are widely used because of their ability to accommodate multiple factors simultaneously. Factor-based regression with main effects and some interactions is the dominant strategy for downstream analysis, delivering point estimators and standard errors simultaneously via one least-squares fit. Justification of these convenient estimators from the design-based perspective requires quantifying their sampling properties under the assignment mechanism while conditioning on the potential outcomes. To this end, we derive the sampling properties of the regression estimators under a wide range of specifications, and establish the appropriateness of the corresponding robust standard errors for Wald-type inference. The results help to clarify the causal interpretation of the coefficients in these factor-based regressions, and motivate the definition of general factorial effects to unify the definitions of factorial effects in various fields. We also quantify the bias-variance trade-off between the saturated and unsaturated regressions from the design-based perspective.
more »
« less
- Award ID(s):
- 1945136
- PAR ID:
- 10337012
- Date Published:
- Journal Name:
- Biometrika
- ISSN:
- 0006-3444
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Randomized experiments are the gold standard for causal inference and enable unbiased estimation of treatment effects. Regression adjustment provides a convenient way to incorporate covariate information for additional efficiency. This article provides a unified account of its utility for improving estimation efficiency in multiarmed experiments. We start with the commonly used additive and fully interacted models for regression adjustment in estimating average treatment effects (ATE), and clarify the trade-offs between the resulting ordinary least squares (OLS) estimators in terms of finite sample performance and asymptotic efficiency. We then move on to regression adjustment based on restricted least squares (RLS), and establish for the first time its properties for inferring ATE from the design-based perspective. The resulting inference has multiple guarantees. First, it is asymptotically efficient when the restriction is correctly specified. Second, it remains consistent as long as the restriction on the coefficients of the treatment indicators, if any, is correctly specified and separate from that on the coefficients of the treatment-covariate interactions. Third, it can have better finite sample performance than the unrestricted counterpart even when the restriction is moderately misspecified. It is thus our recommendation when the OLS fit of the fully interacted regression risks large finite sample variability in case of many covariates, many treatments, yet a moderate sample size. In addition, the newly established theory of RLS also provides a unified way of studying OLS-based inference from general regression specifications. As an illustration, we demonstrate its value for studying OLS-based regression adjustment in factorial experiments. Importantly, although we analyse inferential procedures that are motivated by OLS, we do not invoke any assumptions required by the underlying linear models.more » « less
-
Summary Relational arrays represent measures of association between pairs of actors, often in varied contexts or over time. Trade flows between countries, financial transactions between individuals, contact frequencies between school children in classrooms and dynamic protein-protein interactions are all examples of relational arrays. Elements of a relational array are often modelled as a linear function of observable covariates. Uncertainty estimates for regression coefficient estimators, and ideally the coefficient estimators themselves, must account for dependence between elements of the array, e.g., relations involving the same actor. Existing estimators of standard errors that recognize such relational dependence rely on estimating extremely complex, heterogeneous structure across actors. This paper develops a new class of parsimonious coefficient and standard error estimators for regressions of relational arrays. We leverage an exchangeability assumption to derive standard error estimators that pool information across actors, and are substantially more accurate than existing estimators in a variety of settings. This exchangeability assumption is pervasive in network and array models in the statistics literature, but not previously considered when adjusting for dependence in a regression setting with relational data. We demonstrate improvements in inference theoretically, via a simulation study, and by analysis of a dataset involving international trade.more » « less
-
Abstract Rejective sampling improves design and estimation efficiency of single-phase sampling when auxiliary information in a finite population is available. When such auxiliary information is unavailable, we propose to use two-phase rejective sampling (TPRS), which involves measuring auxiliary variables for the sample of units in the first phase, followed by the implementation of rejective sampling for the outcome in the second phase. We explore the asymptotic design properties of double expansion and regression estimators under TPRS. We show that TPRS enhances the efficiency of the double-expansion estimator, rendering it comparable to a regression estimator. We further refine the design to accommodate varying importance of covariates and extend it to multi-phase sampling. We start with the theory for the population mean and then extend the theory to parameters defined by general estimating equations. Our asymptotic results for TPRS immediately cover the existing single-phase rejective sampling, under which the asymptotic theory has not been fully established.more » « less
-
The Household Pulse Survey, recently released by the U.S. Census Bureau, gathers information about the respondents’ experiences regarding employment status, food security, housing, physical and mental health, access to health care, and education disruption. Design-based estimates are produced for all 50 states and the District of Columbia (DC), as well as 15 Metropolitan Statistical Areas (MSAs). Using public-use microdata, this paper explores the effectiveness of using unit-level model-based estimators that incorporate spatial dependence for the Household Pulse Survey. In particular, we consider Bayesian hierarchical model-based spatial estimates for both a binomial and a multinomial response under informative sampling. Importantly, we demonstrate that these models can be easily estimated using Hamiltonian Monte Carlo through the Stan software package. In doing so, these models can readily be implemented in a production environment. For both the binomial and multinomial responses, an empirical simulation study is conducted, which compares spatial and non-spatial models. Finally, using public-use Household Pulse Survey micro-data, we provide an analysis that compares both design-based and model-based estimators and demonstrates a reduction in standard errors for the model-based approaches.more » « less
An official website of the United States government

