skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Compact, tamper-resistant archival of fine-grained provenance
Data provenance tools aim to facilitate reproducible data science and auditable data analyses, by tracking the processes and inputs responsible for each result of an analysis. Fine-grained provenance further enables sophisticated reasoning about why individual output results appear or fail to appear. However, for reproducibility and auditing, we need a provenance archival system that is tamper-resistant , and efficiently stores provenance for computations computed over time (i.e., it compresses repeated results). We study this problem, developing solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads.  more » « less
Award ID(s):
1640813 1910108 1547360 1763514
PAR ID:
10290599
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
4
ISSN:
2150-8097
Page Range / eLocation ID:
485 to 497
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Debugging big data analytics often requires a root cause analysis to pinpoint the precise culprit records in an input dataset responsible for incorrect or anomalous output. Existing debugging or data provenance approaches do not track fine-grained control and data flows in user-defined application code; thus, the returned culprit data is often too large for manual inspection and expensive post-mortem analysis is required. We design FlowDebug to identify a highly precise set of input records based on two key insights. First, FlowDebug precisely tracks control and data flow within user-defined functions to propagate taints at a fine-grained level by inserting custom data abstractions through automated source to source transformation. Second, it introduces a novel notion of influence-based provenance for many-to-one dependencies to prioritize which input records are more responsible than others by analyzing the semantics of a user-defined function used for aggregation. By design, our approach does not require any modification to the framework's runtime and can be applied to existing applications easily. FlowDebug significantly improves the precision of debugging results by up to 99.9 percentage points and avoids repetitive re-runs required for post-mortem analysis by a factor of 33 while incurring an instrumentation overhead of 0.4X - 6.1X on vanilla Spark. 
    more » « less
  2. Systematically reasoning about the fine-grained causes of events in a real-world distributed system is challenging. Causality, from the distributed systems literature, can be used to compute the causal history of an arbitrary event in a distributed system, but the event's causal history is an over-approximation of the true causes. Data provenance, from the database literature, precisely describes why a particular tuple appears in the output of a relational query, but data provenance is limited to the domain of static relational databases. In this paper, we present wat-provenance: a novel form of provenance that provides the benefits of causality and data provenance. Given an arbitrary state machine, wat-provenance describes why the state machine produces a particular output when given a particular input. This enables system developers to reason about the causes of events in real-world distributed systems. We observe that automatically extracting the wat-provenance of a state machine is often infeasible. Fortunately, many distributed systems components have simple interfaces from which a developer can directly specify wat-provenance using a technique we call wat-provenance specifications. Leveraging the theoretical foundations of wat-provenance, we implement a prototype distributed debugging framework called Watermelon. 
    more » « less
  3. Humans convey their intentions through the usage of both verbal and nonverbal behaviors during face-to-face communication. Speaker intentions often vary dynamically depending on different nonverbal contexts, such as vocal patterns and facial expressions. As a result, when modeling human language, it is essential to not only consider the literal meaning of the words but also the nonverbal contexts in which these words appear. To better model human language, we first model expressive nonverbal representations by analyzing the fine-grained visual and acoustic patterns that occur during word segments. In addition, we seek to capture the dynamic nature of nonverbal intents by shifting word representations based on the accompanying nonverbal behaviors. To this end, we propose the Recurrent Attended Variation Embedding Network (RAVEN) that models the fine-grained structure of nonverbal subword sequences and dynamically shifts word representations based on nonverbal cues. Our proposed model achieves competitive performance on two publicly available datasets for multimodal sentiment analysis and emotion recognition. We also visualize the shifted word representations in different nonverbal contexts and summarize common patterns regarding multimodal variations of word representations. 
    more » « less
  4. Fine-grained urban flow inference (FUFI), which involves inferring fine-grained flow maps from their coarse-grained counterparts, is of tremendous interest in the realm of sustainable urban traffic services. To address the FUFI, existing solutions mainly concentrate on investigating spatial dependencies, introducing external factors, reducing excessive memory costs, etc., -- while rarely considering the catastrophic forgetting (CF) problem. Motivated by recent operator learning, we present an Urban Neural Operator solution with Incremental learning (UNOI), primarily seeking to learn grained-invariant solutions for FUFI in addition to addressing CF. Specifically, we devise an urban neural operator (UNO) in UNOI that learns mappings between approximation spaces by treating the different-grained flows as continuous functions, allowing a more flexible capture of spatial correlations. Furthermore, the phenomenon of CF behind time-related flows could hinder the capture of flow dynamics. Thus, UNOI mitigates CF concerns as well as privacy issues by placing UNO blocks in two incremental settings, i.e., flow-related and task-related. Experimental results on large-scale real-world datasets demonstrate the superiority of our proposed solution against the baselines. 
    more » « less
  5. In safety-critical environments, robots need to reliably recognize human activity to be effective and trust-worthy partners. Since most human activity recognition (HAR) approaches rely on unimodal sensor data (e.g. motion capture or wearable sensors), it is unclear how the relationship between the sensor modality and motion granularity (e.g. gross or fine) of the activities impacts classification accuracy. To our knowledge, we are the first to investigate the efficacy of using motion capture as compared to wearable sensor data for recognizing human motion in manufacturing settings. We introduce the UCSD-MIT Human Motion dataset, composed of two assembly tasks that entail either gross or fine-grained motion. For both tasks, we compared the accuracy of a Vicon motion capture system to a Myo armband using three widely used HAR algorithms. We found that motion capture yielded higher accuracy than the wearable sensor for gross motion recognition (up to 36.95%), while the wearable sensor yielded higher accuracy for fine-grained motion (up to 28.06%). These results suggest that these sensor modalities are complementary, and that robots may benefit from systems that utilize multiple modalities to simultaneously, but independently, detect gross and fine-grained motion. Our findings will help guide researchers in numerous fields of robotics including learning from demonstration and grasping to effectively choose sensor modalities that are most suitable for their applications. 
    more » « less