skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Facilitating Dependency Exploration in Computational Notebooks
Computational notebooks promote exploration by structuring code, output, and explanatory text, into cells. The input code and rich outputs help users iteratively investigate ideas as they explore or analyze data. The links between these cells–how the cells depend on each other–are important in understanding how analyses have been developed and how the results can be reproduced. Specifically, a code cell that uses a particular identifier depends on the cell where that identifier is defined or mutated. Because notebooks promote fluid editing where cells can be moved and run in any order, cell dependencies are not always clear or easy to follow. We examine different tools that seek to address this problem by extending Jupyter notebooks and evaluate how well they support users in accomplishing tasks that require understanding dependencies. We also evaluate visualization techniques that provide views of the dependencies to help users navigate cell dependencies.  more » « less
Award ID(s):
2022443
PAR ID:
10482744
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
HILDA '23: Proceedings of the Workshop on Human-In-the-Loop Data Analytics
Subject(s) / Keyword(s):
Computational notebooks, output-driven analysis, cell dependencies, notebook visualization, reproducibility
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Computational notebooks have gained much pop- ularity as a way of documenting research processes; they allow users to express research narrative by integrating ideas expressed as text, process expressed as code, and results in one executable document. However, the environments in which the code can run are currently limited, often containing only a fraction of the resources of one node, posing a barrier to many computations. In this paper, we make the case that integrating complex experimental environments, such as virtual clusters or complex networking environments that can be provisioned via infrastructure clouds, into computational notebooks will significantly broaden their reach and at the same time help realize the potential of clouds as a platform for repeatable research. To support our argument, we describe the integration of Jupyter notebooks into the Chameleon cloud testbed, which allows the user to define complex experimental environments and then assign processes to elements of this environment similarly to the way a laptop user may switch between different desktops. We evaluate our approach on an actual experiment from both the development and replication perspective. 
    more » « less
  2. null (Ed.)
    Data scientists have embraced computational notebooks to author analysis code and accompanying visualizations within a single document. Currently, although these media may be interleaved, they remain siloed: interactive visualizations must be manually specified as they are divorced from the analysis provenance expressed via dataframes, while code cells have no access to users' interactions with visualizations, and hence no way to operate on the results of interaction. To bridge this divide, we present B2, a set of techniques grounded in treating data queries as a shared representation between the code and interactive visualizations. B2 instruments data frames to track the queries expressed in code and synthesize corresponding visualizations. These visualizations are displayed in a dashboard to facilitate interactive analysis. When an interaction occurs, B2 reifies it as a data query and generates a history log in a new code cell. Subsequent cells can use this log to further analyze interaction results and, when marked as reactive, to ensure that code is automatically recomputed when new interaction occurs. In an evaluative study with data scientists, we find that B2 promotes a tighter feedback loop between coding and interacting with visualizations. All participants frequently moved from code to visualization and vice-versa, which facilitated their exploratory data analysis in the notebook. 
    more » « less
  3. Computational notebooks, such as Jupyter, support rich data visualization. However, even when visualizations in notebooks are interactive, they are a dead end: Interactive data manipulations, such as selections, applying labels, filters, categorizations, or fixes to column or cell values, could be efficiently applied in interactive visual components, but interactive components typically cannot manipulate Python data structures. Furthermore, actions performed in interactive plots are lost as soon as the cell is re‐run, prohibiting reusability and reproducibility. To remedy this problem, we introduce Persist, a family of techniques to (a) capture interaction provenance, enabling the persistence of interactions, and (b) map interactions to data manipulations that can be applied to dataframes. We implement our approach as a JupyterLab extension that supports tracking interactions in Vega‐Altair plots and in a data table view. Persist can re‐execute interaction provenance when a notebook or a cell is re‐executed, enabling reproducibility and re‐use. We evaluate Persist in a user study targeting data manipulations with 11 participants skilled in Python and Pandas, comparing it to traditional code‐based approaches. Participants were consistently faster and were able to correctly complete more tasks with Persist. 
    more » « less
  4. Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model’s accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices, but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process. 
    more » « less
  5. Jupyter Notebooks are an enormously popular tool for creating and narrating computational research projects. They also have enormous potential for creating reproducible scientific research artifacts. Capturing the complete state of a notebook has additional benefits; for instance, the notebook execution may be split between local and remote resources, where the latter may have more powerful processing capabilities or store large or access-limited data. There are several challenges for making notebooks fully reproducible when examined in detail. The notebook code must be replicated entirely, and the underlying Python runtime environments must be identical. More subtle problems arise in replicating referenced data, external library dependencies, and runtime variable states. This paper presents solutions to these problems using Juptyer’s standard extension mechanisms to create an archivable system state for a running notebook. We show that the overhead for these additional mechanisms, which involve interacting with the underlying Linux kernel, does not introduce substantial execution time overheads, demonstrating the approach’s feasibility. 
    more » « less