skip to main content


Title: Debugging Distributed Systems with Why-Across-Time Provenance
Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event's causal history is an over-approximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input. This enables system developers to reason about the causes of events in real-world distributed systems. We observe that automatically extracting the wat-provenance of a state machine is often infeasible. Fortunately, many distributed systems components have simple interfaces from which a developer can directly specify wat-provenance using a technique we call wat-provenance specifications. Leveraging the theoretical foundations of wat-provenance, we implement a prototype distributed debugging framework called Watermelon.  more » « less
Award ID(s):
1652368
NSF-PAR ID:
10080652
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the ACM Symposium on Cloud Computing
Page Range / eLocation ID:
333 to 346
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent advances in causality analysis have enabled investigators to trace multi-stage attacks using provenance graphs. Based on system-layer audit logs (e.g., syscalls), these approaches omit vital sources of application context (e.g., email addresses, HTTP response codes) that can be found in higher layers of the system. Although such information is often essential to understanding attack behaviors, it is difficult to incorporate this evidence into causal analysis engines because of the semantic gap that exists between system layers. To address that shortcoming, we propose the notion of universal provenance, which encodes all forensically relevant causal dependencies regardless of their layer of origin. To transparently realize that vision on commodity systems, we present OmegaLog, a provenance tracker that bridges the semantic gap between system and application logging contexts. OmegaLog analyzes program binaries to identify and model application-layer logging behaviors, enabling accurate reconciliation of application events with system-layer accesses. OmegaLog then intercepts applications’ runtime logging activities and grafts those events onto the system-layer provenance graph, allowing investigators to reason more precisely about the nature of attacks. We demonstrate that our system is widely applicable to existing software projects and can transparently facilitate execution partitioning of provenance graphs without any training or developer intervention. Evaluation on real-world attack scenarios shows that our technique generates concise provenance graphs with rich semantic information relative to the state-of-the-art, with an average runtime overhead of 4% 
    more » « less
  2. Recent advances in causality analysis have enabled investigators to trace multi-stage attacks using whole- system provenance graphs. Based on system-layer audit logs (e.g., syscalls), these approaches omit vital sources of application context (e.g., email addresses, HTTP response codes) that can found in higher layers of the system. Although this information is often essential to understanding attack behaviors, incorporating this evidence into causal analysis engines is difficult due to the semantic gap that exists between system layers. To address this shortcoming, we propose the notion of universal provenance, which encodes all forensically-relevant causal dependencies regardless of their layer of origin. To transparently realize this vision on commodity systems, we present ωLOG (“Omega Log”), a provenance tracking mechanism that bridges the semantic gap between system and application logging contexts. ωLOG analyzes program binaries to identify and model application-layer logging behaviors, enabling application events to be accurately reconciled with system-layer accesses. ωLOG then intercepts applications’ runtime logging activities and grafts those events onto the system-layer provenance graph, allowing investigators to reason more precisely about the nature of attacks. We demonstrate that ωLOG is widely-applicable to existing software projects and can transparently facilitate execution partitioning of dependency graphs without any training or developer intervention. Evaluation on real-world attack scenarios shows that universal provenance graphs are concise and rich with semantic information as compared to the state-of-the-art, with 12% average runtime overhead. 
    more » « less
  3. Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks. 
    more » « less
  4. To guard against machine failures, modern internet services store multiple replicas of the same application data within and across data centers, which introduces the problem of keeping geodistributed replicas consistent with one another in the face of network partitions and unpredictable message latency. To avoid costly and conservative synchronization protocols, many real-world systems provide only weak consistency guarantees (e.g., eventual, causal, or PRAM consistency), which permit certain kinds of disagreement among replicas. There has been much recent interest in language support for specifying and verifying such consistency properties. Although these properties are usually beyond the scope of what traditional type checkers or compiler analyses can guarantee, solver-aided languages are up to the task. Inspired by systems like Liquid Haskell [43] and Rosette [42], we believe that close integration between a language and a solver is the right path to consistent-by-construction distributed applications. Unfortunately, verifying distributed consistency properties requires reasoning about transitive relations (e.g., causality or happens-before), partial orders (e.g., the lattice of replica states under a convergent merge operation), and properties relevant to message processing or API invocation (e.g., commutativity and idempotence) that cannot be easily or efficiently carried out by general-purpose SMT solvers that lack native support for this kind of reasoning. We argue that domain-specific SMT-based tools that exploit the mathematical foundations of distributed consistency would enable both more efficient verification and improved ease of use for domain experts. The principle of exploiting domain knowledge for efficiency and expressivity that has borne fruit elsewhere — such as in the development of high-performance domain-specific languages that trade off generality to gain both performance and productivity — also applies here. Languages augmented with domain-specific, consistency-aware solvers would support the rapid implementation of formally verified programming abstractions that guarantee distributed consistency. In the long run, we aim to democratize the development of such domain-specific solvers by creating a framework for domain-specific solver development that brings new theory solver implementation within the reach of programmers who are not necessarily SMT solver internals experts. 
    more » « less
  5. Safety violations in programmable logic controllers (PLCs), caused either by faults or attacks, have recently garnered significant attention. However, prior efforts at PLC code vetting suffer from many drawbacks. Static analyses and verification cause significant false positives and cannot reveal specific runtime contexts. Dynamic analyses and symbolic execution, on the other hand, fail due to their inability to handle real-world PLC programs that are event-driven and timing sensitive. In this paper, we propose VetPLC, a temporal context-aware, program analysis-based approach to produce timed event sequences that can be used for automatic safety vetting. To this end, we (a) perform static program analysis to create timed event causality graphs in order to understand causal relations among events in PLC code and (b) mine temporal invariants from data traces collected in Industrial Control System (ICS) testbeds to quantitatively gauge temporal dependencies that are constrained by machine operations. Our VetPLC prototype has been implemented in 15K lines of code. We evaluate it on 10 real-world scenarios from two different ICS settings. Our experiments show that VetPLC outperforms state-of-the-art techniques and can generate event sequences that can be used to automatically detect hidden safety violations. 
    more » « less