skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Dads: dynamic slicing continuously-running distributed programs with budget constraints
We present Dads, the first distributed, online, scalable, and cost-effective dynamic slicer for continuously-running distributed programs with respect to user-specified budget constraints. Dads is distributed by design to exploit distributed and parallel computing resources. With an online analysis, it avoids tracing hence the associated time and space costs. Most importantly, Dads achieves and maintains practical scalability and cost-effectiveness tradeoffs according to a given budget on analysis time by continually and automatically adjusting the configuration of its analysis algorithm on the fly via reinforcement learning. Against eight real-world Java distributed systems, we empirically demonstrated the scalability and cost-effectiveness merits of Dads. The open-source tool package of Dads with a demo video is publicly available.  more » « less
Award ID(s):
1936522
PAR ID:
10251189
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
Page Range / eLocation ID:
1566 to 1570
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Distributed software systems are increasingly developed and deployed today. Many of these systems are supposed to run continuously. Given their critical roles in our society and daily lives, assuring the quality of distributed systems is crucial. Analyzing runtime program dependencies has long been a fundamental technique underlying numerous tool support for software quality assurance. Yet conventional approaches to dynamic dependence analysis face severe scalability barriers when they are applied to real-world distributed systems, due to the unbounded executions to be analyzed in addition to common efficiency challenges suffered by dynamic analysis in general. In this article, we present S EADS , a distributed , online , and cost-effective dynamic dependence analysis framework that aims at scaling the analysis to real-world distributed systems. The analysis itself is distributed to exploit the distributed computing resources (e.g., a cluster) of the system under analysis; it works online to overcome the problem with unbounded execution traces while running continuously with the system being analyzed to provide timely querying of analysis results (i.e., runtime dependence set of any given query). Most importantly, given a user-specified time budget, the analysis automatically adjusts itself to better cost-effectiveness tradeoffs (than otherwise) while respecting the budget by changing various analysis parameters according to the time being spent by the dependence analysis. At the core of the automatic adjustment is our application of a reinforcement learning method for the decision making—deciding which configuration to adjust to according to the current configuration and its associated analysis cost with respect to the user budget. We have implemented S EADS for Java and applied it to eight real-world distributed systems with continuous executions. Our empirical results revealed the efficiency and scalability advantages of our framework over a conventional dynamic analysis, at least for dynamic dependence computation at method level. While we demonstrate it in the context of dynamic dependence analysis in this article, the methodology for achieving and maintaining scalability and greater cost-effectiveness against continuously running systems is more broadly applicable to other dynamic analyses. 
    more » « less
  2. Deep reinforcement learning (DRL) has demonstrated significant potential in various applications, including gaming AI, robotics, and system scheduling. DRL algorithms produce, sample, and learn from training data online through a trial-and-error process, demanding considerable time and computational resources. To address this, distributed DRL algorithms and paradigms have been developed to expedite training using extensive resources. Through carefully designed experiments, we are the first to observe that strategically increasing the actor-environment interactions by spawning more concurrent actors at certain training rounds within ephemeral time frames can significantly enhance training efficiency. Yet, current distributed DRL solutions, which are predominantly server-based (or serverful), fail to capitalize on these opportunities due to their long startup times, limited adaptability, and cumbersome scalability. This paper proposesNitro, a generic training engine for distributed DRL algorithms that enforces timely and effective boosting with concurrent actors instantaneously spawned by serverless computing. With serverless functions,Nitroadjusts data sampling strategies dynamically according to the DRL training demands.Nitroseizes the opportunity of real-time boosting by accurately and swiftly detecting an empirical metric. To achieve cost efficiency, we design a heuristic actor scaling algorithm to guideNitrofor cost-aware boosting budget allocation. We integrateNitrowith state-of-the-art DRL algorithms and frameworks and evaluate them on AWS EC2 and Lambda. Experiments with Mujoco and Atari benchmarks show thatNitroimproves the final rewards (i.e., training quality) by up to 6× and reduces training costs by up to 42%. 
    more » « less
  3. The Phasor measurement unit (PMU) measurements are mandatory to monitor the power system’s voltage stability margin in an online manner. Monitoring is key to the secure operation of the grid. Traditionally, online monitoring of voltage stability using synchrophasors required a centralized communication architecture, which leads to the high investment cost and cyber-security concerns. The increasing importance of cyber-security and low investment costs have recently led to the development of distributed algorithms for online monitoring of the grid that are inherently less prone to malicious attacks. In this work, we proposed a novel distributed non-iterative voltage stability index (VSI) by recasting the power flow equations as circles. The processors embedded at each bus in the smart grid with the help of PMUs and communication of voltage phasors between neighboring buses perform simultaneous online computations of VSI. The distributed nature of the index enables the real-time identification of the critical bus of the system with minimal communication infrastructure. The effectiveness of the proposed distributed index is demonstrated on IEEE test systems and contrasted with existing methods to show the benefits of the proposed method in speed, interpretability, identification of outage location, and low sensitivity to noisy measurements. 
    more » « less
  4. Large-scale distributed storage systems, such as object stores, usually apply hashing-based placement and lookup methods to achieve scalability and resource efficiency. However, when object locations are determined by hash values, placement becomes inflexible, failing to optimize or satisfy application requirements such as load balance, failure tolerance, parallelism, and network/system performance. This work presents a novel solution to achieve the best of two worlds: flexibility while maintaining cost-effectiveness and scalability. The proposed method Smash is an object placement and lookup method that achieves full placement flexibility, balanced load, low resource cost, and short latency. Smash utilizes a recent space-efficient data structure and applies it to object-location lookups. We implement Smash as a prototype system and evaluate it in a public cloud. The analysis and experimental results show that Smash achieves full placement flexibility, fast storage operations, fast recovery from node dynamics, and lower DRAM cost (<60%) compared to existing hash-based solutions such as Ceph and MapX. 
    more » « less
  5. In distributed machine learning, while a great deal of attention has been paid on centralized systems that include a central parameter server, decentralized systems have not been fully explored. Decentralized systems have great potentials in the future practical use as they have multiple useful attributes such as less vulnerable to privacy and security issues, better scalability, and less prone to single point of bottleneck and failure. In this paper, we focus on decentralized learning systems and aim to achieve differential privacy with good convergence rate and low communication cost. To achieve this goal, we propose a new algorithm, Leader-Follower Elastic Averaging Stochastic Gradient Descent (LEASGD), driven by a novel Leader-Follower topology and differential privacy model. We also provide a theoretical analysis of the convergence rate of LEASGD and the trade-off between the performance and privacy in the private setting. We evaluate LEASGD in real distributed testbed with poplar deep neural network models MNIST-CNN, MNIST-RNN, and CIFAR-10. Extensive experimental results show that LEASGD outperforms state-of-the-art decentralized learning algorithm DPSGD by achieving nearly 40% lower loss function within same iterations and by 30% reduction of communication cost. Moreover, it spends less differential privacy budget and has final higher accuracy result than DPSGD under private setting. 
    more » « less