skip to main content

Title: A Survey on Causal Inference
Causal inference is a critical research topic across many domains, such as statistics, computer science, education, public policy, and economics, for decades. Nowadays, estimating causal effect from observational data has become an appealing research direction owing to the large amount of available data and low budget requirement, compared with randomized controlled trials. Embraced with the rapidly developed machine learning area, various causal effect estimation methods for observational data have sprung up. In this survey, we provide a comprehensive review of causal inference methods under the potential outcome framework, one of the well-known causal inference frameworks. The methods are divided into two categories depending on whether they require all three assumptions of the potential outcome framework or not. For each category, both the traditional statistical methods and the recent machine learning enhanced methods are discussed and compared. The plausible applications of these methods are also presented, including the applications in advertising, recommendation, medicine, and so on. Moreover, the commonly used benchmark datasets as well as the open-source codes are also summarized, which facilitate researchers and practitioners to explore, evaluate and apply the causal inference methods.
 ;  ;  ;  ;  ;  
Award ID(s):
Publication Date:
Journal Name:
ACM Transactions on Knowledge Discovery from Data
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Recent years have witnessed a rocketing growth of machine learning methods on graph data, especially those powered by effective neural networks. Despite their success in different real‐world scenarios, the majority of these methods on graphs only focus on predictive or descriptive tasks, but lack consideration of causality. Causal inference can reveal the causality inside data, promote human understanding of the learning process and model prediction, and serve as a significant component of artificial intelligence (AI). An important problem in causal inference is causal effect estimation, which aims to estimate the causal effects of a certain treatment (e.g., prescription of medicine) on an outcome (e.g., cure of disease) at an individual level (e.g., each patient) or a population level (e.g., a group of patients). In this paper, we introduce the background of causal effect estimation from observational data, envision the challenges of causal effect estimation with graphs, and then summarize representative approaches of causal effect estimation with graphs in recent years. Furthermore, we provide some insights for future research directions in related area. Link to video abstract:‐ns

  2. Abstract

    Evaluating treatment effect heterogeneity widely informs treatment decision making. At the moment, much emphasis is placed on the estimation of the conditional average treatment effect via flexible machine learning algorithms. While these methods enjoy some theoretical appeal in terms of consistency and convergence rates, they generally perform poorly in terms of uncertainty quantification. This is troubling since assessing risk is crucial for reliable decision-making in sensitive and uncertain environments. In this work, we propose a conformal inference-based approach that can produce reliable interval estimates for counterfactuals and individual treatment effects under the potential outcome framework. For completely randomized or stratified randomized experiments with perfect compliance, the intervals have guaranteed average coverage in finite samples regardless of the unknown data generating mechanism. For randomized experiments with ignorable compliance and general observational studies obeying the strong ignorability assumption, the intervals satisfy a doubly robust property which states the following: the average coverage is approximately controlled if either the propensity score or the conditional quantiles of potential outcomes can be estimated accurately. Numerical studies on both synthetic and real data sets empirically demonstrate that existing methods suffer from a significant coverage deficit even in simple models. In contrast, our methods achieve themore »desired coverage with reasonably short intervals.

    « less
  3. Identifying cause-effect relations among variables is a key step in the decision-making process. Whereas causal inference requires randomized experiments, researchers and policy makers are increasingly using observational studies to test causal hypotheses due to the wide availability of data and the infeasibility of experiments. The matching method is the most used technique to make causal inference from observational data. However, the pair assignment process in one-to-one matching creates uncertainty in the inference because of different choices made by the experimenter. Recently, discrete optimization models have been proposed to tackle such uncertainty; however, they produce 0-1 nonlinear problems and lack scalability. In this work, we investigate this emerging data science problem and develop a unique computational framework to solve the robust causal inference test instances from observational data with continuous outcomes. In the proposed framework, we first reformulate the nonlinear binary optimization problems as feasibility problems. By leveraging the structure of the feasibility formulation, we develop greedy schemes that are efficient in solving robust test problems. In many cases, the proposed algorithms achieve a globally optimal solution. We perform experiments on real-world data sets to demonstrate the effectiveness of the proposed algorithms and compare our results with the state-of-the-art solver. Ourmore »experiments show that the proposed algorithms significantly outperform the exact method in terms of computation time while achieving the same conclusion for causal tests. Both numerical experiments and complexity analysis demonstrate that the proposed algorithms ensure the scalability required for harnessing the power of big data in the decision-making process. Finally, the proposed framework not only facilitates robust decision making through big-data causal inference, but it can also be utilized in developing efficient algorithms for other nonlinear optimization problems such as quadratic assignment problems. History: Accepted by Ram Ramesh, Area Editor for Data Science and Machine Learning. Funding: This work was supported by the Division of Civil, Mechanical and Manufacturing Innovation of the National Science Foundation [Grant 2047094]. Supplemental Material: The online supplements are available at .« less
  4. Abstract

    Structural nested mean models (SNMMs) are useful for causal inference of treatment effects in longitudinal observational studies. Most existing works assume that the data are collected at prefixed time points for all subjects, which, however, may be restrictive in practice. To deal with irregularly spaced observations, we assume a class of continuous‐time SNMMs and a martingale condition of no unmeasured confounding (NUC) to identify the causal parameters. We develop the semiparametric efficiency theory and locally efficient estimators for continuous‐time SNMMs. This task is nontrivial due to the restrictions from the NUC assumption imposed on the SNMM parameter. In the presence of ignorable censoring, we show that the complete‐case estimator is optimal among a class of weighting estimators including the inverse probability of censoring weighting estimator, and it achieves a double robustness feature in that it is consistent if at least one of the models for the potential outcome mean function and the treatment process is correctly specified. The new framework allows us to conduct causal analysis respecting the underlying continuous‐time nature of data processes. The simulation study shows that the proposed estimator outperforms existing approaches. We estimate the effect of time to initiate highly active antiretroviral therapy on themore »CD4 count at year 2 from the observational Acute Infection and Early Disease Research Program database.

    « less
  5. Two important considerations in clinical research studies are proper evaluations of internal and external validity. While randomized clinical trials can overcome several threats to internal validity, they may be prone to poor external validity. Conversely, large prospective observational studies sampled from a broadly generalizable population may be externally valid, yet susceptible to threats to internal validity, particularly confounding. Thus, methods that address confounding and enhance transportability of study results across populations are essential for internally and externally valid causal inference, respectively. These issues persist for another problem closely related to transportability known as data‐fusion. We develop a calibration method to generate balancing weights that address confounding and sampling bias, thereby enabling valid estimation of the target population average treatment effect. We compare the calibration approach to two additional doubly robust methods that estimate the effect of an intervention on an outcome within a second, possibly unrelated target population. The proposed methodologies can be extended to resolve data‐fusion problems that seek to evaluate the effects of an intervention using data from two related studies sampled from different populations. A simulation study is conducted to demonstrate the advantages and similarities of the different techniques. We also test the performance of the calibrationmore »approach in a motivating real data example comparing whether the effect of biguanides vs sulfonylureas—the two most common oral diabetes medication classes for initial treatment—on all‐cause mortality described in a historical cohort applies to a contemporary cohort of US Veterans with diabetes.

    « less