skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 20, 2025

Title: Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud Systems
Debugging in production cloud systems (or live debugging) is a critical yet challenging task for on-call developers due to the financial impact of cloud service downtime and the inherent complexity of cloud systems. Unfortunately, how debugging is performed, and the unique challenges faced in the production cloud environment have not been investigated in detail. In this paper, we perform the first fine-grained, observational study of 93 real-world debugging experiences of production cloud failures in 15 widely adopted open-source distributed systems including distributed storage systems, databases, computing frameworks, message passing systems, and container orchestration systems. We examine each debugging experience with a fine-grained lens and categorize over 1700 debugging steps across all incidents. Our study provides a detailed picture of how developers perform various diagnosis activities including failure reproduction, anomaly analysis, program analysis, hypothesis formulation, information collection and online experiments. Highlights of our study include: (1) Analyses of the taxonomies and distributions of both live debugging activities and the underlying reasons for hypothesis forking, which confirm the presence of expert debugging strategies in production cloud systems, and offer insights to guide the training of novice developers and the development of tools that emulate expert behavior. (2) The identification of the primary challenge in anomaly detection (or, observability) for end-to-end debugging: the collection of system-specific data (17.1% of data collected). In comparison, nearly all (96%) invariants utilized to detect anomalies are already present in existing monitoring tools. (3) The identification of the importance of online interventions (i.e., in-production experiments that alter system execution) for live debugging - they are performed as frequently as information collection - with an investigation of different types of interventions and challenges. (4) An examination of novel debugging techniques developers utilized to overcome debugging challenges inherent to or amplified in cloud systems, which offer insights for the development of enhanced debugging tools.  more » « less
Award ID(s):
2140305
PAR ID:
10588846
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400712869
Page Range / eLocation ID:
341 to 360
Format(s):
Medium: X
Location:
Redmond WA USA
Sponsoring Org:
National Science Foundation
More Like this
  1. In Federated Learning (FL), clients independently train local models and share them with a central aggregator to build a global model. Impermissibility to access clients' data and collaborative training make FL appealing for applications with data-privacy concerns, such as medical imaging. However, these FL characteristics pose unprecedented challenges for debugging. When a global model's performance deteriorates, identifying the responsible rounds and clients is a major pain point. Developers resort to trial-and-error debugging with subsets of clients, hoping to increase the global model's accuracy or let future FL rounds retune the model, which are time-consuming and costly. We design a systematic fault localization framework, Fedde-bug,that advances the FL debugging on two novel fronts. First, Feddebug enables interactive debugging of realtime collaborative training in FL by leveraging record and replay techniques to construct a simulation that mirrors live FL. Feddebug'sbreakpoint can help inspect an FL state (round, client, and global model) and move between rounds and clients' models seam-lessly, enabling a fine-grained step-by-step inspection. Second, Feddebug automatically identifies the client(s) responsible for lowering the global model's performance without any testing data and labels-both are essential for existing debugging techniques. Feddebug's strengths come from adapting differential testing in conjunction with neuron activations to determine the client(s) deviating from normal behavior. Feddebug achieves 100% accuracy in finding a single faulty client and 90.3% accuracy in finding multiple faulty clients. Feddebug's interactive de-bugging incurs 1.2% overhead during training, while it localizes a faulty client in only 2.1% of a round's training time. With FedDebug,we bring effective debugging practices to federated learning, improving the quality and productivity of FL application developers. 
    more » « less
  2. Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event's causal history is an over-approximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input. This enables system developers to reason about the causes of events in real-world distributed systems. We observe that automatically extracting the wat-provenance of a state machine is often infeasible. Fortunately, many distributed systems components have simple interfaces from which a developer can directly specify wat-provenance using a technique we call wat-provenance specifications. Leveraging the theoretical foundations of wat-provenance, we implement a prototype distributed debugging framework called Watermelon. 
    more » « less
  3. Despite advances in debugging tools, systems debugging today remains largely manual. A developer typically follows an iterative and time-consuming process to move from a reported bug to a bug fix. This is because developers are still responsible for making sense of system-wide semantics, bridging together outputs and features from existing debugging tools, and extracting information from many diverse data sources (e.g., bug reports, source code, comments, documentation, and execution traces). We believe that the latest statistical natural language processing (NLP) techniques can help automatically analyze these data sources and significantly improve the systems debugging experience. We present early results to highlight the promise of NLP-powered debugging, and discuss systems and learning challenges that must be overcome to realize this vision. 
    more » « less
  4. Network monitoring is an increasingly important task in the operation of today’s large and complex computer networks. In recent years, technologies leveraging software defined networking and programmable hardware have been proposed. These innovations enable operators to get fine-grained insight into every single packet traversing their network at high rates. They generate packet or flow records of all or a subset of traffic in the network and send them to an analytics system that runs specific applications to detect performance or security issues at line rate in a live manner. Unexplored, however, remains the area of detailed, inter- active, and retrospective analysis of network records for debugging or auditing purposes. This is likely due to technical challenges in storing and querying large amounts of network monitoring data efficiently. In this work, we study these challenges in more detail. In particular, we explore recent advances in time series databases and find that these systems not only scale to millions of records per second but also allow for expressive queries significantly simplifying practical network debugging and data analysis in the context of computer network monitoring. 
    more » « less
  5. Telemetry systems are widely used to collect data from distributed endpoints, analyze data in conjunction to gain valuable insights, and store data for historical analytics. These systems consist of four stages (Figure 1): collection, transmission, analysis, and storage. Collectors at the endpoint collect various types of data, which is then transmitted to a central server for analysis. This data is used for multiple downstream tasks, such as dashboard monitoring and anomaly detection. Finally, this data is stored in long-term storage to aid retrospective analytics and debugging. 
    more » « less