skip to main content


Title: Parallel Gradient Boosting based Granger Causality Learning
Granger causality and its learning algorithms have been widely used in many disciplines to study cause-effect relationship among time series variables. In this paper, we address computing challenges of state-of-art Granger causality learning algorithms, specially when facing increasing dimensionality of available datasets. We study how to leverage gradient boosting meta machine learning techniques to achieve accurate causality discovery and big data parallel techniques for efficient causality discovery from large temporal datasets. We propose two main algorithms for gradient boosting based causality learning, and parallel gradient boosting based causality learning. Our experiments show our proposed algorithms can achieve efficient learning in distributed environments with good learning accuracy.  more » « less
Award ID(s):
1730250
NSF-PAR ID:
10179454
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2019 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
2845 to 2854
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data-driven causality discovery is a common way to understand causal relationships among different components of a system. We study how to achieve scalable data-driven causal- ity discovery on Amazon Web Services (AWS) and Microsoft Azure cloud and propose a causality discovery as a service (CDaaS) framework. With this framework, users can easily re- run previous causality discovery experiments or run causality discovery with different setups (such as new datasets or causality discovery parameters). Our CDaaS leverages Cloud Container Registry service and Virtual Machine service to achieve scal- able causality discovery with different discovery algorithms. We further did extensive experiments and benchmarking of our CDaaS to understand the effects of seven factors (big data engine parameter setting, virtual machine instance number, type, subtype, size, cloud service, cloud provider) and how to best provision cloud resources for our causality discovery service based on certain goals including execution time, budgetary cost and cost-performance ratio. We report our findings from the benchmarking, which can help obtain optimal configurations based on each application’s characteristics. The findings show proper configurations could lead to both faster execution time and less budgetary cost. 
    more » « less
  2. null (Ed.)
    Learning nonlinear functions from input-output data pairs is one of the most fundamental problems in machine learning. Recent work has formulated the problem of learning a general nonlinear multivariate function of discrete inputs, as a tensor completion problem with smooth latent factors. We build upon this idea and utilize two ensemble learning techniques to enhance its prediction accuracy. Ensemble methods can be divided into two main groups, parallel and sequential. Bagging also known as bootstrap aggregation is a parallel ensemble method where multiple base models are trained in parallel on different subsets of the data that have been chosen randomly with replacement from the original training data. The output of these models is usually combined and a single prediction is computed using averaging. One of the most popular bagging techniques is random forests. Boosting is a sequential ensemble method where a sequence of base models are fit sequentially to modified versions of the data. Popular boosting algorithms include AdaBoost and Gradient Boosting. We develop two approaches based on these ensemble learning techniques for learning multivariate functions using the Canonical Polyadic Decomposition. We showcase the effectiveness of the proposed ensemble models on several regression tasks and report significant improvements compared to the single model. 
    more » « less
  3. null (Ed.)
    Current development of high-performance fiber-reinforced cementitious composites (HPFRCC) mainly relies on intensive experiments. The main purpose of this study is to develop a machine learning method for effective and efficient discovery and development of HPFRCC. Specifically, this research develops machine learning models to predict the mechanical properties of HPFRCC through innovative incorporation of micromechanics, aiming to increase the prediction accuracy and generalization performance by enriching and improving the datasets through data cleaning, principal component analysis (PCA), and K-fold cross-validation. This study considers a total of 14 different mix design variables and predicts the ductility of HPFRCC for the first time, in addition to the compressive and tensile strengths. Different types of machine learning methods are investigated and compared, including artificial neural network (ANN), support vector regression (SVR), classification and regression tree (CART), and extreme gradient boosting tree (XGBoost). The results show that the developed machine learning models can reasonably predict the concerned mechanical properties and can be applied to perform parametric studies for the effects of different mix design variables on the mechanical properties. This study is expected to greatly promote efficient discovery and development of HPFRCC. 
    more » « less
  4. Spiking neural networks (SNNs) well support spatio-temporal learning and energy-efficient event-driven hardware neuromorphic processors. As an important class of SNNs, recurrent spiking neural networks (RSNNs) possess great computational power. However, the practical application of RSNNs is severely limited by challenges in training. Biologically-inspired unsupervised learning has limited capability in boosting the performance of RSNNs. On the other hand, existing backpropagation (BP) methods suffer from high complexity of unfolding in time, vanishing and exploding gradients, and approximate differentiation of discontinuous spiking activities when applied to RSNNs. To enable supervised training of RSNNs under a well-defined loss function, we present a novel Spike-Train level RSNNs Backpropagation (ST-RSBP) algorithm for training deep RSNNs. The proposed ST-RSBP directly computes the gradient of a rate-coded loss function defined at the output layer of the network w.r.t tunable parameters. The scalability of ST-RSBP is achieved by the proposed spike-train level computation during which temporal effects of the SNN is captured in both the forward and backward pass of BP. Our ST-RSBP algorithm can be broadly applied to RSNNs with a single recurrent layer or deep RSNNs with multiple feedforward and recurrent layers. Based upon challenging speech and image datasets including TI46, N-TIDIGITS, Fashion-MNIST and MNIST, ST-RSBP is able to train SNNs with an accuracy surpassing that of the current state-of-the-art SNN BP algorithms and conventional non-spiking deep learning models. 
    more » « less
  5. Complex systems are challenging to understand, especially when they defy manipulative experiments for practical or ethical reasons. Several fields have developed parallel approaches to infer causal relations from observational time series. Yet, these methods are easy to misunderstand and often controversial. Here, we provide an accessible and critical review of three statistical causal discovery approaches (pairwise correlation, Granger causality, and state space reconstruction), using examples inspired by ecological processes. For each approach, we ask what it tests for, what causal statement it might imply, and when it could lead us astray. We devise new ways of visualizing key concepts, describe some novel pathologies of existing methods, and point out how so-called ‘model-free’ causality tests are not assumption-free. We hope that our synthesis will facilitate thoughtful application of methods, promote communication across different fields, and encourage explicit statements of assumptions. A video walkthrough is available (Video 1 or https://youtu.be/AlV0ttQrjK8 ). 
    more » « less