skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data Debugging and Exploration with Vizier
We present Vizier, a multi-modal data exploration and debugging tool. The system supports a wide range of operations by seamlessly integrating Python, SQL, and automated data curation and debugging methods. Using Spark as an execution backend, Vizier handles large datasets in multiple formats. Ease-of-use is attained through integration of a notebook with a spreadsheet-style interface and with visualizations that guide and support the user in the loop. In addition, native support for provenance and versioning enable collaboration and uncertainty management. In this demonstration we will illustrate the diverse features of the system using several realistic data science tasks based on real data.  more » « less
Award ID(s):
1640864 1750460
PAR ID:
10129194
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
SIGMOD
Page Range / eLocation ID:
1877 to 1880
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Notebook and spreadsheet systems are currently the de-facto standard for data collection, preparation, and analysis. However, these systems have been criticized for their lack of reproducibility, versioning, and support for sharing. These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis. We present Vizier, an open-source tool that helps analysts to build and refine data pipelines. Vizier combines the flexibility of notebooks with the easy-to-use data manipulation interface of spreadsheets. Combined with advanced provenance tracking for both data and computational steps this enables reproducibility, versioning, and streamlined data exploration. Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as data caveats. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets. 
    more » « less
  2. Sudeepa Roy and Jun Yang (Ed.)
    Data scientists use a wide variety of systems with a wide variety of user interfaces such as spreadsheets and notebooks for their data exploration, discovery, preprocessing, and analysis tasks. While this wide selection of tools offers data scientists the freedom to pick the right tool for each task, each of these tools has limitations (e.g., the lack of reproducibility of notebooks), data needs to be translated between tool-specific formats, and common functionality such as versioning, provenance, and dealing with data errors often has to be implemented for each system. We argue that rather than alternating between task-specific tools, a superior approach is to build multiple user-interfaces on top of a single incremental workflow / dataflow platform with built-in support for versioning, provenance, error & tracking, and data cleaning. We discuss Vizier, a notebook system that implements this approach, introduce the challenges that arose in building such a system, and highlight how our work on Vizier lead to novel research in uncertain data management and incremental execution of workflows. 
    more » « less
  3. Dataflow systems have an increasing need to support a wide range of tasks in data-centric applications using latest techniques such as machine learning. These tasks often involve custom functions with complex internal states. Consequently, users need enhanced debugging support to understand runtime behaviors and investigate internal states of dataflows. Traditional forward debuggers allow users to follow the chronological order of operations in an execution. Therefore, a user cannot easily identify a past runtime behavior after an unexpected result is produced. In this paper, we present a novel time-travel debugging paradigm called IcedTea, which supports reverse debugging. In particular, in a dataflow's execution, which is inherently distributed across multiple operators, the user can periodically interact with the job and retrieve the global states of the operators. After the execution, the system allows the user to roll back the dataflow state to any past interactions. The user can use step instructions to repeat the past execution to understand how data was processed in the original execution. We give a full specification of this powerful paradigm, study how to reduce its runtime overhead and develop techniques to support debugging instructions responsively. Our experiments on real-world datasets and workflows show that IcedTea can support responsive time-travel debugging with low time and space overhead. 
    more » « less
  4. Objectives. Physical computing systems are increasingly being integrated into secondary school science and STEM instruction, yet little is known about how teachers, especially those with little background and experience in computing, help students during the inevitable debugging moments that arise. In this article, we describe a framework, comprising two dimensions, for characterizing how teachers support students as they debug a physical computing system called the Data Sensor Hub (DASH). The DASH enables students to program sensors to measure, analyze, and visualize data as they engage in science inquiry activities. Participants. Five secondary school teachers implemented an inquiry-oriented instructional unit designed to introduce students to working with the DASH as a tool for scientific inquiry. Study Method. Findings drew on video analysis of the teachers’ classroom implementations of the unit. A review of the data corpus led to the selection of 23 moments where the teachers supported an individual or small groups of students engaged in debugging. These moments were analyzed using a grounded perspective based on Interaction Analysis to characterize the teachers’ varied interactional approaches. Findings. Our analysis revealed how teachers’ moves during debugging moments fell along two dimensions. The first dimension characterizes teachers’ positioning during the debugging interactions, ranging from a positioning for teacher understanding to a positioning for student understanding of the bug. The second dimension characterizes the inquiry orientation of the teachers’ questions and guidance, ranging from focusing on the debugging process to focusing on the product—or fixing the bug. Further, teachers’ moves often fell along different points on these dimensions given nuances in the instructional context. Conclusions. The framework offers a first step toward characterizing teachers’ debugging pedagogy as they support students during debugging moments. It also calls attention to how teachers do not necessarily need to be programming experts to effectively help students learn independent and generalizable debugging strategies. Further, it illustrates the variety of expertise that teachers can bring to debugging moments to support students learning to debug. Finally, the framework provides implications for the design of professional learning and supports for teachers as they increasingly are asked to support students in computing—and debugging—activities across a range of disciplines. 
    more » « less
  5. VizieR online Data Catalogue associated with article published in journal Astrophysical Journal with title 'Anomalous orbital characteristics of the AQ Col (EC 05217-3914) system.' (bibcode: 2022ApJ...926...17O) 
    more » « less