skip to main content


Title: CrowdTrace: Visualizing Provenance in Distributed Sensemaking
Capturing analytic provenance is important for refining sensemaking analysis. However, understanding this provenance can be difficult. First, making sense of the reasoning in intermediate steps is time-consuming. Especially in distributed sensemaking, the provenance is less cohesive because each analyst only sees a small portion of the data without an understanding of the overall collaboration workflow. Second, analysis errors from one step can propagate to later steps. Furthermore, in exploratory sensemaking, it is difficult to define what an error is since there are no correct answers to reference. In this paper, we explore provenance analysis for distributed sensemaking in the context of crowdsourcing, where distributed analysis contributions are captured in microtasks. We propose crowd auditing as a way to help individual analysts visualize and trace provenance to debug distributed sensemaking. To evaluate this concept, we implemented a crowd auditing tool, CrowdTrace. Our user study-based evaluation demonstrates that CrowdTrace offers an effective mechanism to audit and refine multi-step crowd sensemaking  more » « less
Award ID(s):
1651969 1527453
NSF-PAR ID:
10208158
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of IEEE VIS 2020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Investigating the nature of system intrusions in large distributed systems remains a notoriously difficult challenge. While monitoring tools (e.g., Firewalls, IDS) provide preliminary alerts through easy-to-use administrative interfaces, attack reconstruction still requires that administrators sift through gigabytes of system audit logs stored locally on hundreds of machines. At present, two fundamental obstacles prevent synergy between system-layer auditing and modern cluster monitoring tools: 1) the sheer volume of audit data generated in a data center is prohibitively costly to transmit to a central node, and 2) system- layer auditing poses a “needle-in-a-haystack” problem, such that hundreds of employee hours may be required to diagnose a single intrusion. This paper presents Winnower, a scalable system for audit-based cluster monitoring that addresses these challenges. Our key insight is that, for tasks that are replicated across nodes in a distributed application, a model can be defined over audit logs to succinctly summarize the behavior of many nodes, thus eliminating the need to transmit redundant audit records to a central monitoring node. Specifically, Winnower parses audit records into provenance graphs that describe the actions of individual nodes, then performs grammatical inference over individual graphs using a novel adaptation of Deterministic Finite Automata (DFA) Learning to produce a behavioral model of many nodes at once. This provenance model can be efficiently transmitted to a central node and used to identify anomalous events in the cluster. We have implemented Winnower for Docker Swarm container clusters and evaluate our system against real-world applications and attacks. We show that Winnower dramatically reduces storage and network overhead associated with aggregating system audit logs, by as much as 98%, without sacrificing the important information needed for attack investigation. Winnower thus represents a significant step forward for security monitoring in distributed systems. 
    more » « less
  2. Auditing, a central pillar of operating system security, has only recently come into its own as an active area of public research. This resurgent interest is due in large part to the notion of data provenance, a technique that iteratively parses audit log entries into a dependency graph that explains the history of system execution. Provenance facilitates precise threat detection and investigation through causal analysis of sophisticated intrusion behaviors. However, the absence of a foundational audit literature, combined with the rapid publication of recent findings, makes it difficult to gain a holistic picture of advancements and open challenges in the area.In this work, we survey and categorize the provenance-based system auditing literature, distilling contributions into a layered taxonomy based on the audit log capture and analysis pipeline. Recognizing that the Reduction Layer remains a key obstacle to the further proliferation of causal analysis technologies, we delve further on this issue by conducting an ambitious independent evaluation of 8 exemplar reduction techniques against the recently-released DARPA Transparent Computing datasets. Our experiments uncover that past approaches frequently prune an overlapping set of activities from audit logs, reducing the synergistic benefits from applying them in tandem; further, we observe an inverse relation between storage efficiency and anomaly detection performance. However, we also observe that log reduction techniques are able to synergize effectively with data compression, potentially reducing log retention costs by multiple orders of magnitude. We conclude by discussing promising future directions for the field. 
    more » « less
  3. Recent work in HCI suggests that users can be powerful in surfacing harmful algorithmic behaviors that formal auditing approaches fail to detect. However, it is not well understood how users are often able to be so effective, nor how we might support more effective user-driven auditing. To investigate, we conducted a series of think-aloud interviews, diary studies, and workshops, exploring how users find and make sense of harmful behaviors in algorithmic systems, both individually and collectively. Based on our findings, we present a process model capturing the dynamics of and influences on users’ search and sensemaking behaviors. We find that 1) users’ search strategies and interpretations are heavily guided by their personal experiences with and exposures to societal bias; and 2) collective sensemaking amongst multiple users is invaluable in user-driven algorithm audits. We offer directions for the design of future methods and tools that can better support user-driven auditing. 
    more » « less
  4. null (Ed.)
    Data provenance tools aim to facilitate reproducible data science and auditable data analyses, by tracking the processes and inputs responsible for each result of an analysis. Fine-grained provenance further enables sophisticated reasoning about why individual output results appear or fail to appear. However, for reproducibility and auditing, we need a provenance archival system that is tamper-resistant , and efficiently stores provenance for computations computed over time (i.e., it compresses repeated results). We study this problem, developing solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads. 
    more » « less
  5. Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency. 
    more » « less