skip to main content

Title: Causal discovery in the presence of missing data
Missing data are ubiquitous in many domain such as healthcare. When these data entries are not missing completely at random, the (conditional) independence relations in the observed data may be different from those in the complete data generated by the underlying causal process.Consequently, simply applying existing causal discovery methods to the observed data may lead to wrong conclusions. In this paper, we aim at developing a causal discovery method to recover the underlying causal structure from observed data that are missing under different mechanisms, including missing completely at random (MCAR),missing at random (MAR), and missing not at random (MNAR). With missingness mechanisms represented by missingness graphs (m-graphs),we analyze conditions under which additional correction is needed to derive conditional independence/dependence relations in the complete data. Based on our analysis, we propose Miss-ing Value PC (MVPC), which extends the PC algorithm to incorporate additional corrections.Our proposed MVPC is shown in theory to give asymptotically correct results even on data that are MAR or MNAR. Experimental results on both synthetic data and real healthcare applications illustrate that the proposed algorithm is able to find correct causal relations even in the general case of MNAR.
Award ID(s):
Publication Date:
Journal Name:
Proceedings of Machine Learning Research
Page Range or eLocation-ID:
1762 - 1770
Sponsoring Org:
National Science Foundation
More Like this
  1. This study compares two missing data procedures in the context of ordinal factor analysis models: pairwise deletion (PD; the default setting in Mplus) and multiple imputation (MI). We examine which procedure demonstrates parameter estimates and model fit indices closer to those of complete data. The performance of PD and MI are compared under a wide range of conditions, including number of response categories, sample size, percent of missingness, and degree of model misfit. Results indicate that both PD and MI yield parameter estimates similar to those from analysis of complete data under conditions where the data are missing completely at random (MCAR). When the data are missing at random (MAR), PD parameter estimates are shown to be severely biased across parameter combinations in the study. When the percentage of missingness is less than 50%, MI yields parameter estimates that are similar to results from complete data. However, the fit indices (i.e., χ 2 , RMSEA, and WRMR) yield estimates that suggested a worse fit than results observed in complete data. We recommend that applied researchers use MI when fitting ordinal factor models with missing data. We further recommend interpreting model fit based on the TLI and CFI incremental fit indices.
  2. A probabilistic query may not be estimable from observed data corrupted by missing values if the data are not missing at random (MAR). It is therefore of theoretical interest and practical importance to determine in principle whether a probabilistic query is estimable from missing data or not when the data are not MAR. We present algorithms that systematically determine whether the joint probability distribution or a target marginal distribution is estimable from observed data with missing values, assuming that the data-generation model is represented as a Bayesian network, known as m-graphs, that not only encodes the dependencies among the variables but also explicitly portrays the mechanisms responsible for the missingness process. The results significantly advance the existing work.
  3. We consider the problem of learning causal relationships from relational data. Existing approaches rely on queries to a relational conditional independence (RCI) oracle to establish and orient causal relations in such a setting. In practice, queries to a RCI oracle have to be replaced by reliable tests for RCI against available data. Relational data present several unique challenges in testing for RCI. We study the conditions under which traditional iid-based CI tests yield reliable answers to RCI queries against relational data. We show how to con- duct CI tests against relational data to robustly recover the underlying relational causal structure. Results of our experiments demonstrate the effectiveness of our proposed approach.
  4. Recently, many regression based conditional independence (CI) test methods have been proposed to solve the problem of causal discovery. These methods provide alternatives to test CI by first removing the information of the controlling set from the two target variables, and then testing the independence between the corresponding residuals Res1 and Res2. When the residuals are linearly uncorrelated, the independence test between them is nontrivial. With the ability to calculate inner product in high-dimensional space, kernel-based methods are usually used to achieve this goal, but still consume considerable time. In this paper, we investigate the independence between two linear combinations under linear non-Gaussian structural equation model. We show that the dependence between the two residuals can be captured by the difference between the similarity of (Res1, Res2) and that of (Res1, Res3) (Res3 is generated by random permutation) in high-dimensional space. With this result, we design a new method called SCIT for CI test, where permutation test is performed to control Type I error rate. The proposed method is simpler yet more efficient and effective than the existing ones. When applied to causal discovery, the proposed method outperforms the counterparts in terms of both speed and Type II error rate,more »especially in the case of small sample size, which is validated by our extensive experiments on various datasets.« less
  5. Arindam Banerjee ; Kenji Fukumizu (Ed.)
    The data drawn from biological, economic, and social systems are often confounded due to the presence of unmeasured variables. Prior work in causal discovery has focused on discrete search procedures for selecting acyclic directed mixed graphs (ADMGs), specifically ancestral ADMGs, that encode ordinary conditional independence constraints among the observed variables of the system. However, confounded systems also exhibit more general equality restrictions that cannot be represented via these graphs, placing a limit on the kinds of structures that can be learned using ancestral ADMGs. In this work, we derive differentiable algebraic constraints that fully characterize the space of ancestral ADMGs, as well as more general classes of ADMGs, arid ADMGs and bow-free ADMGs, that capture all equality restrictions on the observed variables. We use these constraints to cast causal discovery as a continuous optimization problem and design differentiable procedures to find the best fitting ADMG when the data comes from a confounded linear system of equations with correlated errors. We demonstrate the efficacy of our method through simulations and application to a protein expression dataset. Code implementing our methods is open-source and publicly available at https: // and will be incorporated into the Ananke package.