Dataflow systems have an increasing need to support a wide range of tasks in data-centric applications using latest techniques such as machine learning. These tasks often involve custom functions with complex internal states. Consequently, users need enhanced debugging support to understand runtime behaviors and investigate internal states of dataflows. Traditional forward debuggers allow users to follow the chronological order of operations in an execution. Therefore, a user cannot easily identify a past runtime behavior after an unexpected result is produced. In this paper, we present a novel time-travel debugging paradigm called IcedTea, which supports reverse debugging. In particular, in a dataflow's execution, which is inherently distributed across multiple operators, the user can periodically interact with the job and retrieve the global states of the operators. After the execution, the system allows the user to roll back the dataflow state to any past interactions. The user can use step instructions to repeat the past execution to understand how data was processed in the original execution. We give a full specification of this powerful paradigm, study how to reduce its runtime overhead and develop techniques to support debugging instructions responsively. Our experiments on real-world datasets and workflows show that IcedTea can support responsive time-travel debugging with low time and space overhead.
more »
« less
This content will become publicly available on May 19, 2026
POPPER: A Dataflow System for In-Flight Error Handling in Machine Learning Workflows
We present POPPER, a dataflow system for building Machine Learning (ML) workflows. A novel aspect of POPPER is its built-in support for in-flight error handling, which is crucial in developing effective ML workflows. POPPER provides a convenient API that allows users to create and execute complex workflows comprising traditional data processing operations (such as map, filter, and join) and user-defined error handlers. The latter enables inflight detection and correction of errors introduced by ML models in the workflows. Inside POPPER, we model the workflow as a reactive dataflow, a directed cyclic graph, to achieve efficient execution through pipeline parallelization. We demonstrate the in-flight error-handling capabilities of POPPER, for which we have built a graphical interface, allowing users to specify workflows, visualize and interact with its reactive dataflow, and delve into the internals of POPPER.
more »
« less
- Award ID(s):
- 2415216
- PAR ID:
- 10599641
- Publisher / Repository:
- IEEE
- Date Published:
- Format(s):
- Medium: X
- Location:
- Hong Kong, China
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Sudeepa Roy and Jun Yang (Ed.)Data scientists use a wide variety of systems with a wide variety of user interfaces such as spreadsheets and notebooks for their data exploration, discovery, preprocessing, and analysis tasks. While this wide selection of tools offers data scientists the freedom to pick the right tool for each task, each of these tools has limitations (e.g., the lack of reproducibility of notebooks), data needs to be translated between tool-specific formats, and common functionality such as versioning, provenance, and dealing with data errors often has to be implemented for each system. We argue that rather than alternating between task-specific tools, a superior approach is to build multiple user-interfaces on top of a single incremental workflow / dataflow platform with built-in support for versioning, provenance, error & tracking, and data cleaning. We discuss Vizier, a notebook system that implements this approach, introduce the challenges that arose in building such a system, and highlight how our work on Vizier lead to novel research in uncertain data management and incremental execution of workflows.more » « less
-
Chemical reaction networks (CRNs) are an important tool for molecular programming. This field is rapidly expanding our ability to deploy computer programs into biological systems for various applications. However, CRNs are also difficult to work with due to their massively parallel nature, leading to the need for higher-level languages that allow for more straightforward computation with CRNs. Recently, research has been conducted into various higher-level languages for deterministic CRNs but modeling CRN parallelism, managing error accumulation, and finding natural CRN representations are ongoing challenges. We introduce Reactamole, a higher-level language for deterministic CRNs that utilizes the functional reactive programming (FRP) paradigm to represent CRNs as a reactive dataflow network. Reactamole equates a CRN with a functional reactive program, implementing the key primitives of the FRP paradigm directly as CRNs. The functional nature of Reactamole makes reasoning about molecular programs easier, and its strong static typing allows us to ensure that a CRN is well-formed by virtue of being well-typed. In this paper, we describe the design of Reactamole and how we use CRNs to represent the common datatypes and operations found in FRP. We demonstrate the potential of this functional reactive approach to molecular programming by giving an extended example where a CRN is constructed using FRP to modulate and demodulate an amplitude-modulated signal. We also show how Reactamole can be used to specify abstract CRNs whose structure depends on the reactions and species of its input, allowing users to specify more general CRN behaviors.more » « less
-
We consider the question: what is the abstraction that should be implemented by the computational engine of a machine learning system? Current machine learning systems typically push whole tensors through a series of compute kernels such as matrix multiplications or activation functions, where each kernel runs on an AI accelerator (ASIC) such as a GPU. This implementation abstraction provides little built-in support for ML systems to scale past a single machine, or for handling large models with matrices or tensors that do not easily fit into the RAM of an ASIC. In this paper, we present an alternative implementation abstraction called the tensor relational algebra (TRA). The TRA is a set-based algebra based on the relational algebra. Expressions in the TRA operate over binary tensor relations, where keys are multi-dimensional arrays and values are tensors. The TRA is easily executed with high efficiency in a parallel or distributed environment, and amenable to automatic optimization. Our empirical study shows that the optimized TRA-based back-end can significantly outperform alternatives for running ML workflows in distributed clusters.more » « less
-
The continuous growth of CNN complexity not only intensifies the need for hardware acceleration but also presents a huge challenge. That is, the solution space for CNN hardware design and dataflow mapping becomes enormously large besides the fact that it is discrete and lacks a well behaved structure. Most previous works either are stochastic metaheuristics, such as genetic algorithm, which are typically very slow for solving large problems, or rely on expensive sampling, e.g., Gumbel Softmax-based differentiable optimization and Bayesian optimization. We propose an analytical model for evaluating power and performance of CNN hardware design and dataflow solutions. Based on this model, we introduce a co-optimization method consisting of nonlinear programming and parallel local search. A key innovation in this model is its matrix form, which enables the use of deep learning toolkit for highly efficient computations of power/performance values and gradients in the optimization. In handling power-performance tradeoff, our method can lead to better solutions than minimizing a weighted sum of power and latency. The average relative error of our model compared with Timeloop is as small as 1%. Compared to state-of-the-art methods, our approach achieves solutions with up to 1.7 × shorter inference latency, 37.5% less power consumption, and 3 × less area on ResNet 18. Moreover, it provides a 6.2 × speedup of optimizationmore » « less
An official website of the United States government
