skip to main content

Title: Removing Hidden Confounding by Experimental Grounding
Observational data is increasingly used as a means for making individual-level causal predictions and intervention recommendations. The foremost challenge of causal inference from observational data is hidden confounding, whose presence cannot be tested in data and can invalidate any causal conclusion. Experimental data does not suffer from confounding but is usually limited in both scope and scale. We introduce a novel method of using limited experimental data to correct the hidden confounding in causal effect models trained on larger observational data, even if the observational data does not fully overlap with the experimental data. Our method makes strictly weaker assumptions than existing approaches, and we prove conditions under which it yields a consistent estimator. We demonstrate our method's efficacy using real-world data from a large educational experiment.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Advances in neural information processing systems
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    With climate change threatening agricultural productivity and global food demand increasing, it is important to better understand which farm management practices will maximize crop yields in various climatic conditions. To assess the effectiveness of agricultural practices, researchers often turn to randomized field experiments, which are reliable for identifying causal effects but are often limited in scope and therefore lack external validity. Recently, researchers have also leveraged large observational datasets from satellites and other sources, which can lead to conclusions biased by confounding variables or systematic measurement errors. Because experimental and observational datasets have complementary strengths, in this paper we propose a method that uses a combination of experimental and observational data in the same analysis. As a case study, we focus on the causal effect of crop rotation on corn (maize) and soybean yields in the Midwestern United States. We find that, in terms of root mean squared error, our hybrid method performs 13% better than using experimental data alone and 26% better than using the observational data alone in the task of predicting the effect of rotation on corn yield at held-out experimental sites. Further, the causal estimates based on our method suggest that benefits of crop rotations on corn yield are lower in years and locations with high temperatures whereas the benefits of crop rotations on soybean yield are higher in years and locations with high temperatures. In particular, we estimated that the benefit of rotation on corn yields (and soybean yields) was 0.85 t ha−1(0.24 t ha−1) on average for the top quintile of temperatures, 1.03 t ha−1(0.21 t ha−1) on average for the whole dataset, and 1.19 t ha−1(0.16 t ha−1) on average for the bottom quintile of temperatures. This association between temperatures and rotation benefits is consistent with the hypothesis that the benefit of the corn-soybean rotation on soybean yield is largely driven by pest pressure reductions while the benefit of the corn-soybean rotation on corn yields is largely driven by nitrogen availability.

    more » « less
  2. null (Ed.)
    One fundamental problem in causal inference is to learn the individual treatment effects (ITE) -- assessing the causal effects of a certain treatment (e.g., prescription of medicine) on an important outcome (e.g., cure of a disease) for each data instance, but the effectiveness of most existing methods is often limited due to the existence of hidden confounders. Recent studies have shown that the auxiliary relational information among data can be utilized to mitigate the confounding bias. However, these works assume that the observational data and the relations among them are static, while in reality, both of them will continuously evolve over time and we refer such data as time-evolving networked observational data. In this paper, we make an initial investigation of ITE estimation on such data. The problem remains difficult due to the following challenges: (1) modeling the evolution patterns of time-evolving networked observational data; (2) controlling the hidden confounders with current data and historical information; (3) alleviating the discrepancy between the control group and the treated group. To tackle these challenges, we propose a novel ITE estimation framework Dynamic Networked Observational Data Deconfounder (\mymodel) which aims to learn representations of hidden confounders over time by leveraging both current networked observational data and historical information. Additionally, a novel adversarial learning based representation balancing method is incorporated toward unbiased ITE estimation. Extensive experiments validate the superiority of our framework when measured against state-of-the-art baselines. The implementation can be accessed in 
    more » « less
  3. Methicillin-resistant Staphylococcus aureus (MRSA) is a type of bacteria resistant to certain antibiotics, making it difficult to prevent MRSA infections. Among decades of efforts to conquer infectious diseases caused by MRSA, many studies have been proposed to estimate the causal effects of close contact (treatment) on MRSA infection (outcome) from observational data. In this problem, the treatment assignment mechanism plays a key role as it determines the patterns of missing counterfactuals -- the fundamental challenge of causal effect estimation. Most existing observational studies for causal effect learning assume that the treatment is assigned individually for each unit. However, on many occasions, the treatments are pairwisely assigned for units that are connected in graphs, i.e., the treatments of different units are entangled. Neglecting the entangled treatments can impede the causal effect estimation. In this paper, we study the problem of causal effect estimation with treatment entangled in a graph. Despite a few explorations for entangled treatments, this problem still remains challenging due to the following challenges: (1) the entanglement brings difficulties in modeling and leveraging the unknown treatment assignment mechanism; (2) there may exist hidden confounders which lead to confounding biases in causal effect estimation; (3) the observational data is often time-varying. To tackle these challenges, we propose a novel method NEAT, which explicitly leverages the graph structure to model the treatment assignment mechanism, and mitigates confounding biases based on the treatment assignment modeling. We also extend our method into a dynamic setting to handle time-varying observational data. Experiments on both synthetic datasets and a real-world MRSA dataset validate the effectiveness of the proposed method, and provide insights for future applications. 
    more » « less
  4. Abstract Causal effects of biodiversity on ecosystem functions can be estimated using experimental or observational designs — designs that pose a tradeoff between drawing credible causal inferences from correlations and drawing generalizable inferences. Here, we develop a design that reduces this tradeoff and revisits the question of how plant species diversity affects productivity. Our design leverages longitudinal data from 43 grasslands in 11 countries and approaches borrowed from fields outside of ecology to draw causal inferences from observational data. Contrary to many prior studies, we estimate that increases in plot-level species richness caused productivity to decline: a 10% increase in richness decreased productivity by 2.4%, 95% CI [−4.1, −0.74]. This contradiction stems from two sources. First, prior observational studies incompletely control for confounding factors. Second, most experiments plant fewer rare and non-native species than exist in nature. Although increases in native, dominant species increased productivity, increases in rare and non-native species decreased productivity, making the average effect negative in our study. By reducing the tradeoff between experimental and observational designs, our study demonstrates how observational studies can complement prior ecological experiments and inform future ones. 
    more » « less
  5. null (Ed.)
    Abstract The increasing availability of passively observed data has yielded a growing interest in “data fusion” methods, which involve merging data from observational and experimental sources to draw causal conclusions. Such methods often require a precarious tradeoff between the unknown bias in the observational dataset and the often-large variance in the experimental dataset. We propose an alternative approach, which avoids this tradeoff: rather than using observational data for inference, we use it to design a more efficient experiment. We consider the case of a stratified experiment with a binary outcome and suppose pilot estimates for the stratum potential outcome variances can be obtained from the observational study. We extend existing results to generate confidence sets for these variances, while accounting for the possibility of unmeasured confounding. Then, we pose the experimental design problem as a regret minimization problem subject to the constraints imposed by our confidence sets. We show that this problem can be converted into a concave maximization and solved using conventional methods. Finally, we demonstrate the practical utility of our methods using data from the Women’s Health Initiative. 
    more » « less