skip to main content

Title: Causal Relational Learning
Causal inference is at the heart of empirical research in natu- ral and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortu- nately these are not always feasible due to ethical, legal, or cost constraints. As an alternative, methodologies for causal inference from observational data have been developed in sta- tistical studies and social sciences. However, existing meth- ods critically rely on restrictive assumptions such as the study population consisting of homogeneous elements that can be represented in a single flat table, where each row is referred to as a unit. In contrast, in many real-world set- tings, the study domain naturally consists of heterogeneous elements with complex relational structure, where the data is naturally represented in multiple related tables. In this paper, we present a formal framework for causal inference from such relational data. We propose a declarative language called CaRL for capturing causal background knowledge and assumptions, and specifying causal queries using simple Datalog-like rules. CaRL provides a foundation for infer- ring causality and reasoning about the effect of complex interventions in relational domains. We present an extensive experimental evaluation on real relational more » data to illustrate the applicability of CaRL in social sciences and healthcare. « less
Authors:
; ; ; ; ;
Award ID(s):
1703281 1907997 1703431 1740850 1703331
Publication Date:
NSF-PAR ID:
10163826
Journal Name:
SIGMOD
Page Range or eLocation-ID:
241 to 256
Sponsoring Org:
National Science Foundation
More Like this
  1. A significant body of research in the data sciences considers unfair discrimination against social categories such as race or gender that could occur or be amplified as a result of algorithmic decisions. Simultaneously, real-world disparities continue to exist, even before algorithmic decisions are made. In this work, we draw on insights from the social sciences brought into the realm of causal modeling and constrained optimization, and develop a novel algorithmic framework for tackling pre-existing real-world disparities. The purpose of our framework, which we call the “impact remediation framework,” is to measure real-world disparities and discover the optimal intervention policies that could help improve equity or access to opportunity for those who are underserved with respect to an outcome of interest. We develop a disaggregated approach to tackling pre-existing disparities that relaxes the typical set of assumptions required for the use of social categories in structural causal models. Our approach flexibly incorporates counterfactuals and is compatible with various ontological assumptions about the nature of social categories. We demonstrate impact remediation with a hypothetical case study and compare our disaggregated approach to an existing state-of-the-art approach, comparing its structure and resulting policy recommendations. In contrast to most work on optimal policy learning,more »we explore disparity reduction itself as an objective, explicitly focusing the power of algorithms on reducing inequality.« less
  2. Coupled human and natural systems (CHANS) are complex, dynamic, interconnected systems with feedback across social and environmental dimensions. This feedback leads to formidable challenges for causal inference. Two significant challenges involve assumptions about excludability and the absence of interference. These two assumptions have been largely unexplored in the CHANS literature, but when either is violated, causal inferences from observable data are difficult to interpret. To explore their plausibility, structural knowledge of the system is requisite, as is an explicit recognition that most causal variables in CHANS affect a coupled pairing of environmental and human elements. In a large CHANS literature that evaluates marine protected areas, nearly 200 studies attempt to make causal claims, but few address the excludability assumption. To examine the relevance of interference in CHANS, we develop a stylized simulation of a marine CHANS with shocks that can represent policy interventions, ecological disturbances, and technological disasters. Human and capital mobility in CHANS is both a cause of interference, which biases inferences about causal effects, and a moderator of the causal effects themselves. No perfect solutions exist for satisfying excludability and interference assumptions in CHANS. To elucidate causal relationships in CHANS, multiple approaches will be needed for a givenmore »causal question, with the aim of identifying sources of bias in each approach and then triangulating on credible inferences. Within CHANS research, and sustainability science more generally, the path to accumulating an evidence base on causal relationships requires skills and knowledge from many disciplines and effective academic-practitioner collaborations.« less
  3. General methods have been developed for estimating causal effects from observational data under causal assumptions encoded in the form of a causal graph. Most of this literature assumes that the underlying causal graph is completely specified. However, only observational data is available in most practical settings, which means that one can learn at most a Markov equivalence class (MEC) of the underlying causal graph. In this paper, we study the problem of causal estimation from a MEC represented by a partial ancestral graph (PAG), which is learnable from observational data. We develop a general estimator for any identifiable causal effects in a PAG. The result fills a gap for an end-to-end solution to causal inference from observational data to effects estimation. Specifically, we develop a complete identification algorithm that derives an influence function for any identifiable causal effects from PAGs. We then construct a double/debiased machine learning (DML) estimator that is robust to model misspecification and biases in nuisance function estimation, permitting the use of modern machine learning techniques. Simulation results corroborate with the theory.
  4. Conditional independence (CI) tests play a central role in statistical inference, machine learning, and causal discovery. Most existing CI tests assume that the samples are indepen- dently and identically distributed (i.i.d.). How- ever, this assumption often does not hold in the case of relational data. We define Relational Conditional Independence (RCI), a generaliza- tion of CI to the relational setting. We show how, under a set of structural assumptions, we can test for RCI by reducing the task of test- ing for RCI on non-i.i.d. data to the problem of testing for CI on several data sets each of which consists of i.i.d. samples. We develop Kernel Relational CI test (KRCIT), a nonpara- metric test as a practical approach to testing for RCI by relaxing the structural assumptions used in our analysis of RCI. We describe re- sults of experiments with synthetic relational data that show the benefits of KRCIT relative to traditional CI tests that don’t account for the non-i.i.d. nature of relational data.
  5. Causal discovery witnessed significant progress over the past decades. In particular,many recent causal discovery methods make use of independent, non-Gaussian noise to achieve identifiability of the causal models. Existence of hidden direct common causes, or confounders, generally makes causal discovery more difficult;whenever they are present, the corresponding causal discovery algorithms canbe seen as extensions of overcomplete independent component analysis (OICA). However, existing OICA algorithms usually make strong parametric assumptions on the distribution of independent components, which may be violated on real data, leading to sub-optimal or even wrong solutions. In addition, existing OICA algorithms rely on the Expectation Maximization (EM) procedure that requires computationally expensive inference of the posterior distribution of independent components. To tackle these problems, we present a Likelihood-Free Overcomplete ICA algorithm (LFOICA1) that estimates the mixing matrix directly byback-propagation without any explicit assumptions on the density function of independent components. Thanks to its computational efficiency, the proposed method makes a number of causal discovery procedures much more practically feasible.For illustrative purposes, we demonstrate the computational efficiency and efficacy of our method in two causal discovery tasks on both synthetic and real data.