Data-driven causality discovery is a common way to understand causal relationships among different components of a system. We study how to achieve scalable data-driven causal- ity discovery on Amazon Web Services (AWS) and Microsoft Azure cloud and propose a causality discovery as a service (CDaaS) framework. With this framework, users can easily re- run previous causality discovery experiments or run causality discovery with different setups (such as new datasets or causality discovery parameters). Our CDaaS leverages Cloud Container Registry service and Virtual Machine service to achieve scal- able causality discovery with different discovery algorithms. We further did extensive experiments and benchmarking of our CDaaS to understand the effects of seven factors (big data engine parameter setting, virtual machine instance number, type, subtype, size, cloud service, cloud provider) and how to best provision cloud resources for our causality discovery service based on certain goals including execution time, budgetary cost and cost-performance ratio. We report our findings from the benchmarking, which can help obtain optimal configurations based on each application’s characteristics. The findings show proper configurations could lead to both faster execution time and less budgetary cost.
more »
« less
Parallel Gradient Boosting based Granger Causality Learning
Granger causality and its learning algorithms have been widely used in many disciplines to study cause-effect relationship among time series variables. In this paper, we address computing challenges of state-of-art Granger causality learning algorithms, specially when facing increasing dimensionality of available datasets. We study how to leverage gradient boosting meta machine learning techniques to achieve accurate causality discovery and big data parallel techniques for efficient causality discovery from large temporal datasets. We propose two main algorithms for gradient boosting based causality learning, and parallel gradient boosting based causality learning. Our experiments show our proposed algorithms can achieve efficient learning in distributed environments with good learning accuracy.
more »
« less
- Award ID(s):
- 1730250
- PAR ID:
- 10179454
- Date Published:
- Journal Name:
- 2019 IEEE International Conference on Big Data (Big Data)
- Page Range / eLocation ID:
- 2845 to 2854
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Learning nonlinear functions from input-output data pairs is one of the most fundamental problems in machine learning. Recent work has formulated the problem of learning a general nonlinear multivariate function of discrete inputs, as a tensor completion problem with smooth latent factors. We build upon this idea and utilize two ensemble learning techniques to enhance its prediction accuracy. Ensemble methods can be divided into two main groups, parallel and sequential. Bagging also known as bootstrap aggregation is a parallel ensemble method where multiple base models are trained in parallel on different subsets of the data that have been chosen randomly with replacement from the original training data. The output of these models is usually combined and a single prediction is computed using averaging. One of the most popular bagging techniques is random forests. Boosting is a sequential ensemble method where a sequence of base models are fit sequentially to modified versions of the data. Popular boosting algorithms include AdaBoost and Gradient Boosting. We develop two approaches based on these ensemble learning techniques for learning multivariate functions using the Canonical Polyadic Decomposition. We showcase the effectiveness of the proposed ensemble models on several regression tasks and report significant improvements compared to the single model.more » « less
-
Testing for Granger causality relies on estimating the capacity of dynamics in one time series to forecast dynamics in another. The canonical test for such temporal predictive causality is based on fitting multivariate time series models and is cast in the classical null hypothesis testing framework. In this framework, we are limited to rejecting the null hypothesis or failing to reject the null -- we can never validly accept the null hypothesis of no Granger causality. This is poorly suited for many common purposes, including evidence integration, feature selection, and other cases where it is useful to express evidence against, rather than for, the existence of an association. Here we derive and implement the Bayes factor for Granger causality in a multilevel modeling framework. This Bayes factor summarizes information in the data in terms of a continuously scaled evidence ratio between the presence of Granger causality and its absence. We also introduce this procedure for the multilevel generalization of Granger causality testing. This facilitates inference when information is scarce or noisy or if we are interested primarily in population-level trends. We illustrate our approach with an application on exploring causal relationships in affect using a daily life study.more » « less
-
Complex systems are challenging to understand, especially when they defy manipulative experiments for practical or ethical reasons. Several fields have developed parallel approaches to infer causal relations from observational time series. Yet, these methods are easy to misunderstand and often controversial. Here, we provide an accessible and critical review of three statistical causal discovery approaches (pairwise correlation, Granger causality, and state space reconstruction), using examples inspired by ecological processes. For each approach, we ask what it tests for, what causal statement it might imply, and when it could lead us astray. We devise new ways of visualizing key concepts, describe some novel pathologies of existing methods, and point out how so-called ‘model-free’ causality tests are not assumption-free. We hope that our synthesis will facilitate thoughtful application of methods, promote communication across different fields, and encourage explicit statements of assumptions. A video walkthrough is available (Video 1 or https://youtu.be/AlV0ttQrjK8 ).more » « less
-
Abstract Climate system teleconnections are crucial for improving climate predictability, but difficult to quantify. Standard approaches to identify teleconnections are often based on correlations between time series. Here we present a novel method leveraging Granger causality, which can infer/detect relationships between any two fields. We compare teleconnections identified by correlation and Granger causality at different timescales. We find that both Granger causality and correlation consistently recover known seasonal precipitation responses to the sea surface temperature pattern associated with the El Niño Southern Oscillation. Such findings are robust across multiple time resolutions. In addition, we identify candidates for unexplored teleconnection responses.more » « less
An official website of the United States government

