Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics.
more »
« less
Large-Scale Causality Discovery Analytics as a Service
Data-driven causality discovery is a common way to understand causal relationships among different components of a system. We study how to achieve scalable data-driven causal- ity discovery on Amazon Web Services (AWS) and Microsoft Azure cloud and propose a causality discovery as a service (CDaaS) framework. With this framework, users can easily re- run previous causality discovery experiments or run causality discovery with different setups (such as new datasets or causality discovery parameters). Our CDaaS leverages Cloud Container Registry service and Virtual Machine service to achieve scal- able causality discovery with different discovery algorithms. We further did extensive experiments and benchmarking of our CDaaS to understand the effects of seven factors (big data engine parameter setting, virtual machine instance number, type, subtype, size, cloud service, cloud provider) and how to best provision cloud resources for our causality discovery service based on certain goals including execution time, budgetary cost and cost-performance ratio. We report our findings from the benchmarking, which can help obtain optimal configurations based on each application’s characteristics. The findings show proper configurations could lead to both faster execution time and less budgetary cost.
more »
« less
- PAR ID:
- 10303949
- Date Published:
- Journal Name:
- 2021 IEEE Big Data Conference
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Granger causality and its learning algorithms have been widely used in many disciplines to study cause-effect relationship among time series variables. In this paper, we address computing challenges of state-of-art Granger causality learning algorithms, specially when facing increasing dimensionality of available datasets. We study how to leverage gradient boosting meta machine learning techniques to achieve accurate causality discovery and big data parallel techniques for efficient causality discovery from large temporal datasets. We propose two main algorithms for gradient boosting based causality learning, and parallel gradient boosting based causality learning. Our experiments show our proposed algorithms can achieve efficient learning in distributed environments with good learning accuracy.more » « less
-
Equivalent services deliver the same functionality with dissimilar non-functional characteristics, including latency, accuracy, and cost. With these dissimilarities in mind, developers can exploit the combined execution of equivalent services to increase accuracy, shorten latency, or reduce cost. However, it remains unknown how to effectively combine equivalent services to satisfy application requirements. With the recent surge in popularity of machine learning, different vendors offer a plethora of equivalent services, whose characteristics are mostly undocumented. As a result, developers cannot make an informed decision about which service to select from a set of equivalent services. To address this problem, we explore different service combination strategies (i.e., majority voting, weighted-majority voting, stacking, and custom) to ascertain their impact on nonfunctional characteristics. In particular, we study how these strategies impact the accuracy, cost, and latency of the face detection task and validate our findings on the sentiment analysis task. We consider the combined executions of commercial web services, deployed in the cloud, and open-source implementations, deployed as edge services. Our evaluation reveals that the combined execution of equivalent services is most effective for improving cost and latency. Informed by our experimental results, we formulate practical guidelines to help developers identify the best execution strategy for a given set of services.more » « less
-
null (Ed.)Causal profiling is a novel and powerful profiling technique that quantifies the potential impact of optimizing a code segment on the program runtime. A key application of causal profiling is to analyze what-if scenarios which typically require a large number of experiments. Besides, the execution of a program highly depends on the underlying machine resources, e.g., CPU, network, storage, so profiling results on one device does not translate directly to another. This is a major bottleneck in our ability to perform scalable performance analysis and greatly limits cross-platform software development. In this paper, we address the above challenges by leveraging a unique property of causal profiling: only relative performance of different resources affects the result of causal profiling, not their absolute performance. We first analytically model and prove causal profiling, the missing piece in the seminal paper. Then, we assert the necessary condition to achieve virtual causal profiling on a secondary device. Building upon the theory, we design VCoz, a virtual causal profiler that enables profiling applications on target devices using measurements on the host device. We implement a prototype of VCoz by tuning multiple hardware components to preserve the relative execution speeds of code segments. Our experiments on benchmarks that stress different system resources demonstrate that VCoz can generate causal profiling reports of Nexus 6P (an ARM-based device) on a host MacBook (x86 architecture) with less than 16% variance.more » « less
-
Complex systems are challenging to understand, especially when they defy manipulative experiments for practical or ethical reasons. Several fields have developed parallel approaches to infer causal relations from observational time series. Yet, these methods are easy to misunderstand and often controversial. Here, we provide an accessible and critical review of three statistical causal discovery approaches (pairwise correlation, Granger causality, and state space reconstruction), using examples inspired by ecological processes. For each approach, we ask what it tests for, what causal statement it might imply, and when it could lead us astray. We devise new ways of visualizing key concepts, describe some novel pathologies of existing methods, and point out how so-called ‘model-free’ causality tests are not assumption-free. We hope that our synthesis will facilitate thoughtful application of methods, promote communication across different fields, and encourage explicit statements of assumptions. A video walkthrough is available (Video 1 or https://youtu.be/AlV0ttQrjK8 ).more » « less
An official website of the United States government

