skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on August 24, 2025

Title: Causal Inference with Latent Variables: Recent Advances and Future Prospectives
Causality lays the foundation for the trajectory of our world. Causal inference (CI), which aims to infer intrinsic causal relations among variables of interest, has emerged as a crucial research topic. Nevertheless, the lack of observation of important variables (e.g., confounders, mediators, exogenous variables, etc.) severely compromises the reliability of CI methods. The issue may arise from the inherent difficulty in measuring the variables. Additionally, in observational studies where variables are passively recorded, certain covariates might be inadvertently omitted by the experimenter. Depending on the type of unobserved variables and the specific CI task, various consequences can be incurred if these latent variables are carelessly handled, such as biased estimation of causal effects, incomplete understanding of causal mechanisms, lack of individual-level causal consideration, etc. In this survey, we provide a comprehensive review of recent developments in CI with latent variables. We start by discussing traditional CI techniques when variables of interest are assumed to be fully observed. Afterward, under the taxonomy of circumvention and inference-based methods, we provide an in-depth discussion of various CI strategies to handle latent variables, covering the tasks of causal effect estimation, mediation analysis, counterfactual reasoning, and causal discovery. Furthermore, we generalize the discussion to graph data where interference among units may exist. Finally, we offer fresh aspects for further advancement of CI with latent variables, especially new opportunities in the era of large language models (LLMs).  more » « less
Award ID(s):
2223769 2228534 2154962 2144209 2006844 2316306
PAR ID:
10538553
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Date Published:
ISBN:
9798400704901
Page Range / eLocation ID:
6677 to 6687
Format(s):
Medium: X
Location:
Barcelona Spain
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation Estimating causal queries, such as changes in protein abundance in response to a perturbation, is a fundamental task in the analysis of biomolecular pathways. The estimation requires experimental measurements on the pathway components. However, in practice many pathway components are left unobserved (latent) because they are either unknown, or difficult to measure. Latent variable models (LVMs) are well-suited for such estimation. Unfortunately, LVM-based estimation of causal queries can be inaccurate when parameters of the latent variables are not uniquely identified, or when the number of latent variables is misspecified. This has limited the use of LVMs for causal inference in biomolecular pathways. Results In this article, we propose a general and practical approach for LVM-based estimation of causal queries. We prove that, despite the challenges above, LVM-based estimators of causal queries are accurate if the queries are identifiable according to Pearl’s do-calculus and describe an algorithm for its estimation. We illustrate the breadth and the practical utility of this approach for estimating causal queries in four synthetic and two experimental case studies, where structures of biomolecular pathways challenge the existing methods for causal query estimation. Availability and implementation The code and the data documenting all the case studies are available at https://github.com/srtaheri/LVMwithDoCalculus. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Answering counterfactual queries has important applications such as explainability, robustness, and fairness but is challenging when the causal variables are unobserved and the observations are non-linear mixtures of these latent variables, such as pixels in images. One approach is to recover the latent Structural Causal Model (SCM), which may be infeasible in practice due to requiring strong assumptions, e.g., linearity of the causal mechanisms or perfect atomic interventions. Meanwhile, more practical ML-based approaches using naive domain translation models to generate counterfactual samples lack theoretical grounding and may construct invalid counterfactuals. In this work, we strive to strike a balance between practicality and theoretical guarantees by analyzing a specific type of causal query called domain counterfactuals, which hypothesizes what a sample would have looked like if it had been generated in a different domain (or environment). We show that recovering the latent SCM is unnecessary for estimating domain counterfactuals, thereby sidestepping some of the theoretic challenges. By assuming invertibility and sparsity of intervention, we prove domain counterfactual estimation error can be bounded by a data fit term and intervention sparsity term. Building upon our theoretical results, we develop a theoretically grounded practical algorithm that simplifies the modeling process to generative model estimation under autoregressive and shared parameter constraints that enforce intervention sparsity. Finally, we show an improvement in counterfactual estimation over baseline methods through extensive simulated and image-based experiments. 
    more » « less
  3. Chaudhuri, Kamalika ; Jegelka, Stefanie ; Song, Le ; Szepesvari, Csaba ; Niu, Gang ; Sabato, Sivan (Ed.)
    Traditional causal discovery methods mainly focus on estimating causal relations among measured variables, but in many real-world problems, such as questionnaire-based psychometric studies, measured variables are generated by latent variables that are causally related. Accordingly, this paper investigates the problem of discovering the hidden causal variables and estimating the causal structure, including both the causal relations among latent variables and those between latent and measured variables. We relax the frequently-used measurement assumption and allow the children of latent variables to be latent as well, and hence deal with a specific type of latent hierarchical causal structure. In particular, we define a minimal latent hierarchical structure and show that for linear non-Gaussian models with the minimal latent hierarchical structure, the whole structure is identifiable from only the measured variables. Moreover, we develop a principled method to identify the structure by testing for Generalized Independent Noise (GIN) conditions in specific ways. Experimental results on both synthetic and real-world data show the effectiveness of the proposed approach. 
    more » « less
  4. Tests of conditional independence (CI) of ran- dom variables play an important role in ma- chine learning and causal inference. Of partic- ular interest are kernel-based CI tests which allow us to test for independence among ran- dom variables with complex distribution func- tions. The efficacy of a CI test is measured in terms of its power and its calibratedness. We show that the Kernel CI Permutation Test (KCIPT) suffers from a loss of calibratedness as its power is increased by increasing the number of bootstraps. To address this limita- tion, we propose a novel CI test, called Self- Discrepancy Conditional Independence Test (SDCIT). SDCIT uses a test statistic that is a modified unbiased estimate of maximum mean discrepancy (MMD), the largest difference in the means of features of the given sample and its permuted counterpart in the kernel-induced Hilbert space. We present results of experi- ments that demonstrate SDCIT is, relative to the other methods: (i) competitive in terms of its power and calibratedness, outperforming other methods when the number of condition- ing variables is large; (ii) more robust with re- spect to the choice of the kernel function; and (iii) competitive in run time. 
    more » « less
  5. Identifying latent variables and causal structures from observational data is essential to many real-world applications involving biological data, medical data, and unstructured data such as images and languages. However, this task can be highly challenging, especially when observed variables are generated by causally related latent variables and the relationships are nonlinear. In this work, we investigate the identification problem for nonlinear latent hierarchical causal models in which observed variables are generated by a set of causally related latent variables, and some latent variables may not have observed children. We show that the identifiability of causal structures and latent variables (up to invertible transformations) can be achieved under mild assumptions: on causal structures, we allow for multiple paths between any pair of variables in the graph, which relaxes latent tree assumptions in prior work; on structural functions, we permit general nonlinearity and multi-dimensional continuous variables, alleviating existing work's parametric assumptions. Specifically, we first develop an identification criterion in the form of novel identifiability guarantees for an elementary latent variable model. Leveraging this criterion, we show that both causal structures and latent variables of the hierarchical model can be identified asymptotically by explicitly constructing an estimation procedure. To the best of our knowledge, our work is the first to establish identifiability guarantees for both causal structures and latent variables in nonlinear latent hierarchical models. 
    more » « less