skip to main content


Title: Fine-Grained Provenance for Matching & ETL
Data provenance tools capture the steps used to produce analyses. However, scientists must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; provenance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks - for data types such as strings, images, etc. Scientists need new capabilities to identify the sources of errors, find why different code versions produce different results, and identify which parameter values affect output. We propose PROVision, a provenance-driven troubleshooting tool that supports ETL and matching computations and traces extraction of content within data objects. PROVision extends database-style provenance techniques to capture equivalences, support optimizations, and enable selective evaluation. We formalize our extensions, implement them in the PROVision system, and validate their effectiveness and scalability for common ETL and matching tasks.  more » « less
Award ID(s):
1640813
NSF-PAR ID:
10111015
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2019 IEEE 35th International Conference on Data Engineering (ICDE)
Page Range / eLocation ID:
184 to 195
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sudeepa Roy and Jun Yang (Ed.)
    Data scientists use a wide variety of systems with a wide variety of user interfaces such as spreadsheets and notebooks for their data exploration, discovery, preprocessing, and analysis tasks. While this wide selection of tools offers data scientists the freedom to pick the right tool for each task, each of these tools has limitations (e.g., the lack of reproducibility of notebooks), data needs to be translated between tool-specific formats, and common functionality such as versioning, provenance, and dealing with data errors often has to be implemented for each system. We argue that rather than alternating between task-specific tools, a superior approach is to build multiple user-interfaces on top of a single incremental workflow / dataflow platform with built-in support for versioning, provenance, error & tracking, and data cleaning. We discuss Vizier, a notebook system that implements this approach, introduce the challenges that arose in building such a system, and highlight how our work on Vizier lead to novel research in uncertain data management and incremental execution of workflows. 
    more » « less
  2. Many research codes assume a user’s proficiency with high-performance computing tools, which often hinders their adoption by a community of users. Our goal is to create a user-friendly gateway to allow such users to leverage new ca- pabilities brought forward to the fracture mechanics community by the phase-field approach to fracture, implemented in the open source code vDef. We leveraged popular existing tools for building such frame- works: Agave, Django, and Docker, to build a Science Gateway that allows a user to submit a large number of jobs at once. We use the Agave framework to run jobs and handle all communications with the high-performance computers, as well as data sharing and tracking of provenance. Django was used to create a web application. Docker provided an easily deployable image of the system, simplifying setup by the user. The result is a system that masks all interactions with the high- performance computing environment and provides a graphical interface that makes sense for scientists. In the common situation of parameter sweeps our gateway also helps the scientists comparing outputs of various computations using a matrix view that links to individual computations. 
    more » « less
  3. The database community has largely focused on providing improved transaction management and query capabilities over records (and generalizations thereof). Yet such capabilities address only a small part of today’s data science tasks, which are often much more focused on discovery, linking, comparative analysis, and collaboration across holistic datasets and data products. Data scientists frequently point to a strong need for data management — with respect to their many datasets and data products. We propose the development of the dataset relationship management system to support five main classes of operations on datasets: reuse of schema, data, curation, and work across many datasets; revelation of provenance, context, and assumptions; rapid revision of data and processing steps; system-assisted retargeting of computation to alternative execution environments; and metrics to reward individuals’ contributions to the broader data ecosystem. We argue that the recent adoption of computational notebooks (particularly JupyterLab and Jupyter Notebook), as a unified interface over data tools, provides an ideal way of gathering detailed information about how data is being used, i.e., of transparently capturing dataset provenance and relationships, and thus such notebooks provide an attractive mechanism for integrating dataset relationship management into the data science ecosystem. We briefly outline our experiences in building towards JuNEAU, the first prototype DRMS. 
    more » « less
  4. null (Ed.)
    Deep learning now offers state-of-the-art accuracy for many prediction tasks. A form of deep learning called deep convolutional neural networks (CNNs) are especially popular on image, video, and time series data. Due to its high computational cost, CNN inference is often a bottleneck in analytics tasks on such data. Thus, a lot of work in the computer architecture, systems, and compilers communities study how to make CNN inference faster. In this work, we show that by elevating the abstraction level and re-imagining CNN inference as queries , we can bring to bear database-style query optimization techniques to improve CNN inference efficiency. We focus on tasks that perform CNN inference repeatedly on inputs that are only slightly different . We identify two popular CNN tasks with this behavior: occlusion-based explanations (OBE) and object recognition in videos (ORV). OBE is a popular method for “explaining” CNN predictions. It outputs a heatmap over the input to show which regions (e.g., image pixels) mattered most for a given prediction. It leads to many re-inference requests on locally modified inputs. ORV uses CNNs to identify and track objects across video frames. It also leads to many re-inference requests. We cast such tasks in a unified manner as a novel instance of the incremental view maintenance problem and create a comprehensive algebraic framework for incremental CNN inference that reduces computational costs. We produce materialized views of features produced inside a CNN and connect them with a novel multi-query optimization scheme for CNN re-inference. Finally, we also devise novel OBE-specific and ORV-specific approximate inference optimizations exploiting their semantics. We prototype our ideas in Python to create a tool called Krypton that supports both CPUs and GPUs. Experiments with real data and CNNs show that Krypton reduces runtimes by up to 5× (respectively, 35×) to produce exact (respectively, high-quality approximate) results without raising resource requirements. 
    more » « less
  5. As data-driven methods are becoming pervasive in a wide variety of disciplines, there is an urgent need to develop scalable and sustainable tools to simplify the process of data science, to make it easier to keep track of the analyses being performed and datasets being generated, and to enable introspection of the workflows. In this paper, we describe our vision of a unified provenance and metadata management system to support lifecycle management of complex collaborative data science workflows. We argue that a large amount of information about the analysis processes and data artifacts can, and should be, captured in a semi-passive manner; and we show that querying and analyzing this information can not only simplify bookkeeping and debugging tasks for data analysts but can also enable a rich new set of capabilities like identifying flaws in the data science process itself. It can also significantly reduce the time spent in fixing post-deployment problems through automated analysis and monitoring. We have implemented an initial prototype of our system, called ProvDB, on top of git (a version control system) and Neo4j (a graph database), and we describe its key features and capabilities. 
    more » « less