skip to main content

Title: Debugging Distributed Systems with Why-Across-Time Provenance
Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event's causal history is an over-approximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input. This enables system developers to reason about the causes of events in real-world distributed systems. We observe that automatically extracting the wat-provenance of a state machine is often infeasible. Fortunately, many distributed systems components have simple interfaces from which a developer can directly specify wat-provenance using a technique we call wat-provenance specifications. Leveraging the theoretical foundations of wat-provenance, we implement a prototype distributed debugging framework called Watermelon.
Authors:
; ; ;
Award ID(s):
1652368
Publication Date:
NSF-PAR ID:
10080652
Journal Name:
Proceedings of the ACM Symposium on Cloud Computing
Page Range or eLocation-ID:
333 to 346
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent advances in causality analysis have enabled investigators to trace multi-stage attacks using provenance graphs. Based on system-layer audit logs (e.g., syscalls), these approaches omit vital sources of application context (e.g., email addresses, HTTP response codes) that can be found in higher layers of the system. Although such information is often essential to understanding attack behaviors, it is difficult to incorporate this evidence into causal analysis engines because of the semantic gap that exists between system layers. To address that shortcoming, we propose the notion of universal provenance, which encodes all forensically relevant causal dependencies regardless of their layer of origin. To transparently realize that vision on commodity systems, we present OmegaLog, a provenance tracker that bridges the semantic gap between system and application logging contexts. OmegaLog analyzes program binaries to identify and model application-layer logging behaviors, enabling accurate reconciliation of application events with system-layer accesses. OmegaLog then intercepts applications’ runtime logging activities and grafts those events onto the system-layer provenance graph, allowing investigators to reason more precisely about the nature of attacks. We demonstrate that our system is widely applicable to existing software projects and can transparently facilitate execution partitioning of provenance graphs without any training or developer intervention.more »Evaluation on real-world attack scenarios shows that our technique generates concise provenance graphs with rich semantic information relative to the state-of-the-art, with an average runtime overhead of 4%« less
  2. Recent advances in causality analysis have enabled investigators to trace multi-stage attacks using whole- system provenance graphs. Based on system-layer audit logs (e.g., syscalls), these approaches omit vital sources of application context (e.g., email addresses, HTTP response codes) that can found in higher layers of the system. Although this information is often essential to understanding attack behaviors, incorporating this evidence into causal analysis engines is difficult due to the semantic gap that exists between system layers. To address this shortcoming, we propose the notion of universal provenance, which encodes all forensically-relevant causal dependencies regardless of their layer of origin. To transparently realize this vision on commodity systems, we present ωLOG (“Omega Log”), a provenance tracking mechanism that bridges the semantic gap between system and application logging contexts. ωLOG analyzes program binaries to identify and model application-layer logging behaviors, enabling application events to be accurately reconciled with system-layer accesses. ωLOG then intercepts applications’ runtime logging activities and grafts those events onto the system-layer provenance graph, allowing investigators to reason more precisely about the nature of attacks. We demonstrate that ωLOG is widely-applicable to existing software projects and can transparently facilitate execution partitioning of dependency graphs without any training or developer intervention.more »Evaluation on real-world attack scenarios shows that universal provenance graphs are concise and rich with semantic information as compared to the state-of-the-art, with 12% average runtime overhead.« less
  3. Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.
  4. Abstract—Safety violations in programmable logic controllers (PLCs), caused either by faults or attacks, have recently garnered significant attention. However, prior efforts at PLC code vetting suffer from many drawbacks. Static analyses and verification cause significant false positives and cannot reveal specific runtime contexts. Dynamic analyses and symbolic execution, on the other hand, fail due to their inability to handle real-world PLC pro- grams that are event-driven and timing sensitive. In this paper, we propose VETPLC, a temporal context-aware, program analysis- based approach to produce timed event sequences that can be used for automatic safety vetting. To this end, we (a) perform static program analysis to create timed event causality graphs in order to understand causal relations among events in PLC code and (b) mine temporal invariants from data traces collected in Industrial Control System (ICS) testbeds to quantitatively gauge temporal dependencies that are constrained by machine operations. Our VETPLC prototype has been implemented in 15K lines of code. We evaluate it on 10 real-world scenarios from two different ICS settings. Our experiments show that VETPLC outperforms state-of-the-art techniques and can generate event sequences that can be used to automatically detect hidden safety violations.
  5. Safety violations in programmable logic controllers (PLCs), caused either by faults or attacks, have recently garnered significant attention. However, prior efforts at PLC code vetting suffer from many drawbacks. Static analyses and verification cause significant false positives and cannot reveal specific runtime contexts. Dynamic analyses and symbolic execution, on the other hand, fail due to their inability to handle real-world PLC programs that are event-driven and timing sensitive. In this paper, we propose VetPLC, a temporal context-aware, program analysis-based approach to produce timed event sequences that can be used for automatic safety vetting. To this end, we (a) perform static program analysis to create timed event causality graphs in order to understand causal relations among events in PLC code and (b) mine temporal invariants from data traces collected in Industrial Control System (ICS) testbeds to quantitatively gauge temporal dependencies that are constrained by machine operations. Our VetPLC prototype has been implemented in 15K lines of code. We evaluate it on 10 real-world scenarios from two different ICS settings. Our experiments show that VetPLC outperforms state-of-the-art techniques and can generate event sequences that can be used to automatically detect hidden safety violations.