skip to main content


Title: Text and Causal Inference: A Review of Using Text to Remove Confounding from Causal Estimates
Many applications of computational social science aim to infer causal conclusions from non-experimental data. Such observational data often contains confounders, variables that influence both potential causes and potential effects. Unmeasured or latent confounders can bias causal estimates, and this has motivated interest in measuring potential confounders from observed text. For example, an individual’s entire history of social media posts or the content of a news article could provide a rich measurement of multiple confounders.Yet, methods and applications for this problem are scattered across different communities and evaluation practices are inconsistent.This review is the first to gather and categorize these examples and provide a guide to data-processing and evaluation decisions. Despite increased attention on adjusting for confounding using text, there are still many open problems, which we highlight in this paper.  more » « less
Award ID(s):
1845576
NSF-PAR ID:
10187048
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Page Range / eLocation ID:
5332 to 5344
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Methicillin-resistant Staphylococcus aureus (MRSA) is a type of bacteria resistant to certain antibiotics, making it difficult to prevent MRSA infections. Among decades of efforts to conquer infectious diseases caused by MRSA, many studies have been proposed to estimate the causal effects of close contact (treatment) on MRSA infection (outcome) from observational data. In this problem, the treatment assignment mechanism plays a key role as it determines the patterns of missing counterfactuals --- the fundamental challenge of causal effect estimation. Most existing observational studies for causal effect learning assume that the treatment is assigned individually for each unit. However, on many occasions, the treatments are pairwisely assigned for units that are connected in graphs, i.e., the treatments of different units are entangled. Neglecting the entangled treatments can impede the causal effect estimation. In this paper, we study the problem of causal effect estimation with treatment entangled in a graph. Despite a few explorations for entangled treatments, this problem still remains challenging due to the following challenges: (1) the entanglement brings difficulties in modeling and leveraging the unknown treatment assignment mechanism; (2) there may exist hidden confounders which lead to confounding biases in causal effect estimation; (3) the observational data is often time-varying. To tackle these challenges, we propose a novel method NEAT, which explicitly leverages the graph structure to model the treatment assignment mechanism, and mitigates confounding biases based on the treatment assignment modeling. We also extend our method into a dynamic setting to handle time-varying observational data. Experiments on both synthetic datasets and a real-world MRSA dataset validate the effectiveness of the proposed method, and provide insights for future applications. 
    more » « less
  2. Discovery of causal relations from observational data is essential for many disciplines of science and real-world applications. However, unlike other machine learning algorithms, whose development has been greatly fostered by a large amount of available benchmark datasets, causal discovery algorithms are notoriously difficult to be systematically evaluated because few datasets with known ground-truth causal relations are available. In this work, we handle the problem of evaluating causal discovery algorithms by building a flexible simulator in the medical setting. We develop a neuropathic pain diagnosis simulator, inspired by the fact that the biological processes of neuropathic pathophysiology are well studied with well-understood causal influences. Our simulator exploits the causal graph of theneuropathic pain pathology and its parameters in the generator are estimated from real-life patient cases. We show that the data generated from our simulator have similar statistics as real-world data. As a clear advantage, the simulator can produce infinite samples without jeopardizing the privacy of real-world patients. Our simulator provides a natural tool for evaluating various types of causal discovery algorithms, including those to deal with practical issues in causal discovery, such as unknown confounders, selection bias, and missing data. Using our simulator,we have evaluated extensively causal discovery algorithms under various settings. 
    more » « less
  3. Analyzing the causal impact of different policies in reducing the spread of COVID-19 is of critical importance. The main challenge here is the existence of unobserved confounders (e.g., vigilance of residents) which influence both the presence of policies and the spread of COVID-19. Besides, as the confounders may be time-varying, it is even more difficult to capture them. Fortunately, the increasing prevalence of web data from various online applications provides an important resource of time-varying observational data, and enhances the opportunity to capture the confounders from them, e.g., the vigilance of residents over time can be reflected by the popularity of Google searches about COVID-19 at different time periods. In this paper, we study the problem of assessing the causal effects of different COVID-19 related policies on the outbreak dynamics in different counties at any given time period. To this end, we integrate COVID-19 related observational data covering different U.S. counties over time, and then develop a neural network based causal effect estimation framework which learns the representations of time-varying (unobserved) confounders from the observational data. Experimental results indicate the effectiveness of our proposed framework in quantifying the causal impact of policies at different granularities, ranging from a category of policies with a certain goal to a specific policy type. Compared with baseline methods, our assessment of policies is more consistent with existing epidemiological studies of COVID-19. Besides, our assessment also provides insights for future policy-making. 
    more » « less
  4. Arai, Kohei (Ed.)
    Discovering causal knowledge is an important aspect of much scientific research and such findings are often recorded in scholarly articles. Automatically identifying such knowledge from article text can be a useful tool and can act as an impetus for further research on those topics. Numerous applications, including building a causal knowledge graph, making pipelines for root cause analysis, discovering opportunities for drug discovery, and overall, a scalable building block towards turning large pieces of text into organized information can be built following such an approach. However, it requires robust methods to identify and aggregate causal knowledge from a large set of articles. The main challenge in designing new methods is the absence of a large labeled dataset. As a result, existing methods trained on existing datasets with limited size and variations in linguistic pattern, are unable to generalize well on unseen text. In this paper, we explore multiple unsupervised approaches, including a reinforcement learning-based model that learns to identify causal sentences from a small set of labeled sentences. We describe and discuss in detail our experiments for each approach to further encourage exploration of methods that can be re-utilized for different tasks as well, as opposed to simply exploring a supervised learning process which although superior in performance lacks the versatility to be re-purposed for slightly different tasks. We evaluate our methods on a custom-created dataset and show unique techniques to extract cause-effect relationships from the English language. 
    more » « less
  5. Abstract

    Instrumental variable methods are among the most commonly used causal inference approaches to deal with unmeasured confounders in observational studies. The presence of invalid instruments is the primary concern for practical applications, and a fast-growing area of research is inference for the causal effect with possibly invalid instruments. This paper illustrates that the existing confidence intervals may undercover when the valid and invalid instruments are hard to separate in a data-dependent way. To address this, we construct uniformly valid confidence intervals that are robust to the mistakes in separating valid and invalid instruments. We propose to search for a range of treatment effect values that lead to sufficiently many valid instruments. We further devise a novel sampling method, which, together with searching, leads to a more precise confidence interval. Our proposed searching and sampling confidence intervals are uniformly valid and achieve the parametric length under the finite-sample majority and plurality rules. We apply our proposal to examine the effect of education on earnings. The proposed method is implemented in the R package RobustIV available from CRAN.

     
    more » « less