skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PRUNE: A preserving run environment for reproducible scientific computing
Computing as a whole suffers from a crisis of reproducibility. Programs executed in one context are aston- ishingly hard to reproduce in another context, resulting in wasted effort by people and general distrust of results produced by computer. The root of the problem lies in the fact that every program has implicit dependencies on data and execution environment which are rarely understood by the end user. To address this problem, we present PRUNE, the Preserving Run Environment. In PRUNE, every task to be executed is wrapped in a functional interface and coupled with a strictly defined environment. The task is then executed by PRUNE rather than the user to ensure reproducibility. As a scientific workflow evolves in PRUNE, a growing but immutable tree of derived data is created. The provenance of every item in the system can be precisely described, facilitating sharing and modification between collaborating researchers, along with efficient management of limited storage space. We present the user interface and the initial prototype of PRUNE, and demonstrate its application in matching records and comparing surnames in U.S. Censuses  more » « less
Award ID(s):
1642409
PAR ID:
10047189
Author(s) / Creator(s):
;
Date Published:
Journal Name:
IEEE Conference on e-Science
Page Range / eLocation ID:
61 to 70
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Computational notebooks, such as Jupyter, support rich data visualization. However, even when visualizations in notebooks are interactive, they are a dead end: Interactive data manipulations, such as selections, applying labels, filters, categorizations, or fixes to column or cell values, could be efficiently applied in interactive visual components, but interactive components typically cannot manipulate Python data structures. Furthermore, actions performed in interactive plots are lost as soon as the cell is re‐run, prohibiting reusability and reproducibility. To remedy this problem, we introduce Persist, a family of techniques to (a) capture interaction provenance, enabling the persistence of interactions, and (b) map interactions to data manipulations that can be applied to dataframes. We implement our approach as a JupyterLab extension that supports tracking interactions in Vega‐Altair plots and in a data table view. Persist can re‐execute interaction provenance when a notebook or a cell is re‐executed, enabling reproducibility and re‐use. We evaluate Persist in a user study targeting data manipulations with 11 participants skilled in Python and Pandas, comparing it to traditional code‐based approaches. Participants were consistently faster and were able to correctly complete more tasks with Persist. 
    more » « less
  2. Mobile technologies that drive just-in-time ecological momentary assessments and interventions provide an unprecedented view into user behaviors and opportunities to manage chronic conditions. The success of these methods rely on engaging the user at the appropriate moment, so as to maximize questionnaire and task completion rates. However, mobile operating systems provide little support to precisely specify the contextual conditions in which to notify and engage the user, and study designers often lack the expertise to build context-aware software themselves. To address this problem, we have developed Emu, a framework that eases the development of context-aware study applications by providing a concise and powerful interface for specifying temporal- and contextual-constraints for task notifications. In this paper we present the design of the Emu API and demonstrate its use in capturing a range of scenarios common to smartphone-based study applications. 
    more » « less
  3. Notebook and spreadsheet systems are currently the de-facto standard for data collection, preparation, and analysis. However, these systems have been criticized for their lack of reproducibility, versioning, and support for sharing. These shortcomings are particularly detrimental for data curation where data scientists iteratively build workflows to clean up and integrate data as a prerequisite for analysis. We present Vizier, an open-source tool that helps analysts to build and refine data pipelines. Vizier combines the flexibility of notebooks with the easy-to-use data manipulation interface of spreadsheets. Combined with advanced provenance tracking for both data and computational steps this enables reproducibility, versioning, and streamlined data exploration. Unique to Vizier is that it exposes potential issues with data, no matter whether they already exist in the input or are introduced by the operations of a notebook. We refer to such potential errors as data caveats. Caveats are propagated alongside data using principled techniques from uncertain data management. Vizier provides extensive user interface support for caveats, e.g., exposing them as summaries in a dedicated error view and highlighting cells with caveats in spreadsheets. 
    more » « less
  4. An increasing number of distributed applications operate by dispatching function invocations across the nodes of a distributed system. To operate correctly, the code and data dependencies of the function must be distributed along with the invocations in some way. When translating applications to work on large scale distributed systems, managing these dependencies becomes challenging: delivery must be scalable to thousands of nodes; the dependencies must be consistent across the system; and the method must be usable by an unprivileged developer. As a solution, in this paper we present PONCHO, which is a lightweight Python based toolkit which allows users to discover, package, and deploy dependencies as an integral part of distributed applications. PONCHO encapsulates a set of commands to be executed within an environment. PONCHO offers a lightweight solution to create and manage environments increasing the portability of scientific applications as well as reproducibility. In this paper, we evaluate PONCHO with real-world applications in the fields of physics, computational chemistry, and hyperparameter optimization, We observe the challenges that arise when creating and distributing an environment and measure the overheads that emerge as a result. 
    more » « less
  5. null (Ed.)
    Due to the proliferation of Internet of Things (IoT) and application/user demands that challenge communication and computation, edge computing has emerged as the paradigm to bring computing resources closer to users. In this paper, we present Whispering, an analytical model for the migration of services (service offloading) from the cloud to the edge, in order to minimize the completion time of computational tasks offloaded by user devices and improve the utilization of resources. We also empirically investigate the impact of reusing the results of previously executed tasks for the execution of newly received tasks (computation reuse) and propose an adaptive task offloading scheme between edge and cloud. Our evaluation results show that Whispering achieves up to 35% and 97% (when coupled with computation reuse) lower task completion times than cases where tasks are executed exclusively at the edge or the cloud. 
    more » « less