skip to main content


Title: Data-driven causal analysis of observational biological time series
Complex systems are challenging to understand, especially when they defy manipulative experiments for practical or ethical reasons. Several fields have developed parallel approaches to infer causal relations from observational time series. Yet, these methods are easy to misunderstand and often controversial. Here, we provide an accessible and critical review of three statistical causal discovery approaches (pairwise correlation, Granger causality, and state space reconstruction), using examples inspired by ecological processes. For each approach, we ask what it tests for, what causal statement it might imply, and when it could lead us astray. We devise new ways of visualizing key concepts, describe some novel pathologies of existing methods, and point out how so-called ‘model-free’ causality tests are not assumption-free. We hope that our synthesis will facilitate thoughtful application of methods, promote communication across different fields, and encourage explicit statements of assumptions. A video walkthrough is available (Video 1 or https://youtu.be/AlV0ttQrjK8 ).  more » « less
Award ID(s):
1917258
NSF-PAR ID:
10403134
Author(s) / Creator(s):
;
Date Published:
Journal Name:
eLife
Volume:
11
ISSN:
2050-084X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Recent years have witnessed a rocketing growth of machine learning methods on graph data, especially those powered by effective neural networks. Despite their success in different real‐world scenarios, the majority of these methods on graphs only focus on predictive or descriptive tasks, but lack consideration of causality. Causal inference can reveal the causality inside data, promote human understanding of the learning process and model prediction, and serve as a significant component of artificial intelligence (AI). An important problem in causal inference is causal effect estimation, which aims to estimate the causal effects of a certain treatment (e.g., prescription of medicine) on an outcome (e.g., cure of disease) at an individual level (e.g., each patient) or a population level (e.g., a group of patients). In this paper, we introduce the background of causal effect estimation from observational data, envision the challenges of causal effect estimation with graphs, and then summarize representative approaches of causal effect estimation with graphs in recent years. Furthermore, we provide some insights for future research directions in related area. Link to video abstract:https://youtu.be/BpDPOOqw‐ns

     
    more » « less
  2. Granger causality is among the widely used data-driven approaches for causal analysis of time series data with applications in various areas including economics, molecular biology, and neuroscience. Two of the main challenges of this methodology are: 1) over-fitting as a result of limited data duration, and 2) correlated process noise as a confounding factor, both leading to errors in identifying the causal influences. Sparse estimation via the LASSO has successfully addressed these challenges for parameter estimation. However, the classical statistical tests for Granger causality resort to asymptotic analysis of ordinary least squares, which require long data duration to be useful and are not immune to confounding effects. In this work, we address this disconnect by introducing a LASSO-based statistic and studying its non-asymptotic properties under the assumption that the true models admit sparse autoregressive representations. We establish fundamental limits for reliable identification of Granger causal influences using the proposed LASSO-based statistic. We further characterize the false positive error probability and test power of a simple thresholding rule for identifying Granger causal effects and provide two methods to set the threshold in a data-driven fashion. We present simulation studies and application to real data to compare the performance of our proposed method to ordinary least squares and existing LASSO-based methods in detecting Granger causal influences, which corroborate our theoretical results. 
    more » « less
  3. null (Ed.)
    The Arctic sea ice has retreated rapidly in the past few decades, which is believed to be driven by various dynamic and thermodynamic processes in the atmosphere. The newly open water resulted from sea ice decline in turn exerts large influence on the atmosphere. Therefore, this study aims to investigate the causality between multiple atmospheric processes and sea ice variations using three distinct data-driven causality approaches that have been proposed recently: Temporal Causality Discovery Framework Non-combinatorial Optimization via Trace Exponential and Augmented lagrangian for Structure learning (NOTEARS) and Directed Acyclic Graph-Graph Neural Networks (DAG-GNN). We apply these three algorithms to 39 years of historical time-series data sets, which include 11 atmospheric variables from ERA-5 reanalysis product and passive microwave satellite retrieved sea ice extent. By comparing the causality graph results of these approaches with what we summarized from the literature, it shows that the static graphs produced by NOTEARS and DAG-GNN are relatively reasonable. The results from NOTEARS indicate that relative humidity and precipitation dominate sea ice changes among all variables, while the results from DAG-GNN suggest that the horizontal and meridional wind are more important for driving sea ice variations. However, both approaches produce some unrealistic cause-effect relationships. Additionally, these three methods cannot well detect the delayed impact of one variable on another in the Arctic. It also turns out that the results are rather sensitive to the choice of hyperparameters of the three methods. As a pioneer study, this work paves the way to disentangle the complex causal relationships in the Earth system, by taking the advantage of cutting-edge Artificial Intelligence technologies. 
    more » « less
  4. Over the past several years, multiple different methods to measure the causal fairness of machine learning models have been proposed. However, despite the growing number of publications and implementations, there is still a critical lack of literature that explains the interplay of causality-based fairness notions with the social sciences of philosophy, sociology, and law. We hope to remedy this issue by accumulating and expounding upon the thoughts and discussions of causality-based fairness notions produced by both social and formal (specifically machine learning) sciences in this field guide. In addition to giving the mathematical backgrounds of several popular causality-based fair machine learning notions, we explain their connection to and interplay with the fields of philosophy and law. Further, we explore several criticisms of the current approaches to causality-based fair machine learning from a sociological viewpoint as well as from a technical standpoint. It is our hope that this field guide will help fair machine learning practitioners better understand how their causality-based fairness notions align with important humanistic values (such as fairness) and how we can, as a field, design methods and metrics to better serve oppressed and marginalized populaces. 
    more » « less
  5. Abstract Motivation

    Computer inference of biological mechanisms is increasingly approachable due to dynamically rich data sources such as single-cell genomics. Inferred molecular interactions can prioritize hypotheses for wet-lab experiments to expedite biological discovery. However, complex data often come with unwanted biological or technical variations, exposing biases over marginal distribution and sample size in current methods to favor spurious causal relationships.

    Results

    Considering function direction and strength as evidence for causality, we present an adapted functional chi-squared test (AdpFunChisq) that rewards functional patterns over non-functional or independent patterns. On synthetic and three biology datasets, we demonstrate the advantages of AdpFunChisq over 10 methods on overcoming biases that give rise to wide fluctuations in the performance of alternative approaches. On single-cell multiomics data of multiple phenotype acute leukemia, we found that the T-cell surface glycoprotein CD3 delta chain may causally mediate specific genes in the viral carcinogenesis pathway. Using the causality-by-functionality principle, AdpFunChisq offers a viable option for robust causal inference in dynamical systems.

    Availability and implementation

    The AdpFunChisq test is implemented in the R package ‘FunChisq’ (2.5.2 or above) at https://cran.r-project.org/package=FunChisq. All other source code along with pre-processed data is available at Code Ocean https://doi.org/10.24433/CO.2907738.v1

    Supplementary information

    Supplementary materials are available at Bioinformatics online.

     
    more » « less