skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 1, 2026

Title: Causal Structural Modeling of Survey Questionnaires via a Bootstrapped Ordinal Bayesian Network Approach
Abstract Survey questionnaires are commonly used by psychologists and social scientists to measure various latent traits of study subjects. Various causal inference methods such as the potential outcome framework and structural equation models have been used to infer causal effects. However, the majority of these methods assume the knowledge of true causal structure, which is unknown for many applications in psychological and social sciences. This calls for alternative causal approaches for analyzing such questionnaire data. Bayesian networks are a promising option as they do not require causal structure to be knowna prioribut learn it objectively from data. Although we have seen some recent successes of using Bayesian networks to discover causality for psychological questionnaire data, their techniques tend to suffer from causal non-identifiability with observational data. In this paper, we propose the use of a state-of-the-art Bayesian network that is proven to be fully identifiable for observational ordinal data. We develop a causal structure learning algorithm based on an asymptotically justified BIC score function, a hill-climbing search strategy, and the bootstrapping technique, which is able to not only identify a unique causal structure but also quantify the associated uncertainty. Using simulation studies, we demonstrate the power of the proposed learning algorithm by comparing it with alternative Bayesian network methods. For illustration, we consider a dataset from a psychological study of the functional relationships among the symptoms of obsessive-compulsive disorder and depression. Without any prior knowledge, the proposed algorithm reveals some plausible causal relationships. This paper is accompanied by a user-friendly open-source R package OrdCD on CRAN.  more » « less
Award ID(s):
2112943
PAR ID:
10611439
Author(s) / Creator(s):
; ;
Publisher / Repository:
Cambridge University Press
Date Published:
Journal Name:
Psychometrika
Volume:
90
Issue:
1
ISSN:
0033-3123
Page Range / eLocation ID:
229 to 250
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Bayesian networks have been widely used to generate causal hypotheses from multivariate data. Despite their popularity, the vast majority of existing causal discovery approaches make the strong assumption of a (partially) homogeneous sampling scheme. However, such assumption can be seriously violated, causing significant biases when the underlying population is inherently heterogeneous. To this end, we propose a novel causal Bayesian network model, termed BN-LTE, that embeds heterogeneous samples onto a low-dimensional manifold and builds Bayesian networks conditional on the embedding. This new framework allows for more precise network inference by improving the estimation resolution from the population level to the observation level. Moreover, while causal Bayesian networks are in general not identifiable with purely observational, cross-sectional data due to Markov equivalence, with the blessing of causal effect heterogeneity, we prove that the proposed BN-LTE is uniquely identifiable under relatively mild assumptions. Through extensive experiments, we demonstrate the superior performance of BN-LTE in causal structure learning as well as inferring observation-specific gene regulatory networks from observational data. 
    more » « less
  2. The standard approach to answering an identifiable causaleffect query (e.g., P(Y |do(X)) given a causal diagram and observational data is to first generate an estimand, or probabilistic expression over the observable variables, which is then evaluated using the observational data. In this paper, we propose an alternative paradigm for answering causal-effect queries over discrete observable variables. We propose to instead learn the causal Bayesian network and its confounding latent variables directly from the observational data. Then, efficient probabilistic graphical model (PGM) algorithms can be applied to the learned model to answer queries. Perhaps surprisingly, we show that this model completion learning approach can be more effective than estimand approaches, particularly for larger models in which the estimand expressions become computationally difficult. We illustrate our method’s potential using a benchmark collection of Bayesian networks and synthetically generated causal models 
    more » « less
  3. ABSTRACT Experiments have long been the gold standard for causal inference in Ecology. As Ecology tackles progressively larger problems, however, we are moving beyond the scales at which randomised controlled experiments are feasible. To answer causal questions at scale, we need to also use observational data —something Ecologists tend to view with great scepticism. The major challenge using observational data for causal inference is confounding variables: variables affecting both a causal variable and response of interest. Unmeasured confounders—known or unknown—lead to statistical bias, creating spurious correlations and masking true causal relationships. To combat this omitted variable bias, other disciplines have developed rigorous approaches for causal inference from observational data that flexibly control for broad suites of confounding variables. We show how ecologists can harness some of these methods—causal diagrams to identify confounders coupled with nested sampling and statistical designs—to reduce risks of omitted variable bias. Using an example of estimating warming effects on snails, we show how current methods in Ecology (e.g., mixed models) produce incorrect inferences due to omitted variable bias and how alternative methods can eliminate it, improving causal inferences with weaker assumptions. Our goal is to expand tools for causal inference using observational and imperfect experimental data in Ecology. 
    more » « less
  4. Abstract Multivariate functional data arise in a wide range of applications. One fundamental task is to understand the causal relationships among these functional objects of interest. In this paper, we develop a novel Bayesian network (BN) model for multivariate functional data where conditional independencies and causal structure are encoded by a directed acyclic graph. Specifically, we allow the functional objects to deviate from Gaussian processes, which is the key to unique causal structure identification even when the functions are measured with noises. A fully Bayesian framework is designed to infer the functional BN model with natural uncertainty quantification through posterior summaries. Simulation studies and real data examples demonstrate the practical utility of the proposed model. 
    more » « less
  5. Complex causal networks underlie many real-world problems, from the regulatory interactions between genes to the environmental patterns used to understand climate change. Computational methods seek to infer these causal networks using observational data and domain knowledge. In this paper, we identify three key requirements for inferring the structure of causal networks for scientific discovery: (1) robustness to noise in observed measurements; (2) scalability to handle hundreds of variables; and (3) flexibility to encode domain knowledge and other structural constraints. We first formalize the problem of joint probabilistic causal structure discovery.  We develop an approach using probabilistic soft logic (PSL) that exploits multiple statistical tests, supports efficient optimization over hundreds of variables, and can easily incorporate structural constraints, including imperfect domain knowledge. We compare our method against multiple well-studied approaches on biological and synthetic datasets, showing improvements of up to 20% in F1-score over the best performing baseline in realistic settings. 
    more » « less