NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

IcedTea: Efficient and Responsive Time-Travel Debugging in Dataflow Systems

https://doi.org/10.14778/3712221.3712251

Ni, Shengquan; Huang, Yicong; Wang, Zuozhi; Li, Chen (September 2025, Proceedings of the VLDB Endowment)

Dataflow systems have an increasing need to support a wide range of tasks in data-centric applications using latest techniques such as machine learning. These tasks often involve custom functions with complex internal states. Consequently, users need enhanced debugging support to understand runtime behaviors and investigate internal states of dataflows. Traditional forward debuggers allow users to follow the chronological order of operations in an execution. Therefore, a user cannot easily identify a past runtime behavior after an unexpected result is produced. In this paper, we present a novel time-travel debugging paradigm called IcedTea, which supports reverse debugging. In particular, in a dataflow's execution, which is inherently distributed across multiple operators, the user can periodically interact with the job and retrieve the global states of the operators. After the execution, the system allows the user to roll back the dataflow state to any past interactions. The user can use step instructions to repeat the past execution to understand how data was processed in the original execution. We give a full specification of this powerful paradigm, study how to reduce its runtime overhead and develop techniques to support debugging instructions responsively. Our experiments on real-world datasets and workflows show that IcedTea can support responsive time-travel debugging with low time and space overhead.
more » « less
Free, publicly-accessible full text available September 1, 2026
Demonstration of Udon: Line-by-line Debugging of User-Defined Functions in Data Workflows

https://doi.org/10.1145/3626246.3654756

Huang, Yicong; Wang, Zuozhi; Li, Chen (June 2024, ACM)

Full Text Available
Udon: Efficient Debugging of User-Defined Functions in Big Data Systems with Line-by-Line Control

https://doi.org/10.1145/3626712

Huang, Yicong; Wang, Zuozhi; Li, Chen (December 2023, Proceedings of the ACM on Management of Data)

Many big data systems are written in languages such as C, C++, Java, and Scala to process large amounts of data efficiently, while data analysts often use Python to conduct data wrangling, statistical analysis, and machine learning. User-defined functions (UDFs) are commonly used in these systems to bridge the gap between the two ecosystems. In this paper, we propose Udon, a novel debugger to support fine-grained debugging of UDFs. Udon encapsulates the modern line-by-line debugging primitives, such as the ability to set breakpoints, perform code inspections, and make code modifications while executing a UDF on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show its high efficiency and scalability.
more » « less
Full Text Available
How the experience of California wildfires shape Twitter climate change framings

https://doi.org/10.1007/s10584-023-03668-0

Ko, Jessie_W Y; Ni, Shengquan; Taylor, Alexander; Chen, Xiusi; Huang, Yicong; Kumar, Avinash; Alsudais, Sadeem; Wang, Zuozhi; Liu, Xiaozhen; Wang, Wei; et al (January 2024, Climatic Change)

Abstract Climate communication scientists search for effective message strategies to engage the ambivalent public in support of climate advocacy. The personal experience of wildfire is expected to render climate change impacts more concretely, pointing to a potential message strategy to engage the public. This study examined Twitter discourse related to climate change during the onset of 20 wildfires in California between the years 2017 and 2021. In this mixed method study, we analyzed tweets geographically and temporally proximal to the occurrence of wildfires to discover framings and examined how frequencies in climate framings changed before and after fires. Results identified three predominant climate framings: linking wildfire to climate change, suggesting climate actions, and attributing climate change to adversities besides wildfires. Mean tweet frequencies linking wildfire to climate change and attributing adversities increased significantly after the onset of fire. While suggesting climate action tweets also increased, the increase was not statistically significant. Temporal analysis of tweet frequencies for the three themes of tweets showed that discussion increased after the onset of a fire but persisted typically no more than 2 weeks. For fires that burned for longer periods of more than a month, external events triggered climate discussions. Our findings contribute to identifying how the personal experience of wildfire shapes Twitter discussion related to climate change, and how these framings change over time during wildfire events, leading to insights into critical time points after wildfire for implementing message strategies to increase public engagement on climate change impacts and policy.
more » « less
Full Text Available
Demonstration of collaborative and interactive workflow-based data analytics in texera

https://doi.org/10.14778/3554821.3554888

Liu, Xiaozhen; Wang, Zuozhi; Ni, Shengquan; Alsudais, Sadeem; Huang, Yicong; Kumar, Avinash; Li, Chen (August 2022, Proceedings of the VLDB Endowment)

Collaborative data analytics is becoming increasingly important due to the higher complexity of data science, more diverse skills from different disciplines, more common asynchronous schedules of team members, and the global trend of working remotely. In this demo we will show how Texera supports this emerging computing paradigm to achieve high productivity among collaborators with various backgrounds. Based on our active joint projects on the system, we use a scenario of social media analysis to show how a data science task can be conducted on a user friendly yet powerful platform by a multi-disciplinary team including domain scientists with limited coding skills and experienced machine learning experts. We will present how to do collaborative editing of a workflow and collaborative execution of the workflow in Texera. We will focus on data-centric features such as synchronization of operator schemas among the users during the construction phase, and monitoring and controlling the shared runtime during the execution phase.
more » « less
Full Text Available
Demonstration of accelerating machine learning inference queries with correlative proxy models

https://doi.org/10.14778/3554821.3554887

Yang, Zhihui; Huang, Yicong; Wang, Zuozhi; Gao, Feng; Lu, Yao; Li, Chen; Wang, X. Sean (August 2022, Proceedings of the VLDB Endowment)

We will demonstrate a prototype query-processing engine, which utilizes correlations among predicates to accelerate machine learning (ML) inference queries on unstructured data. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable by classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) build a cheap proxy model for each predicate offline, and inject proxy models in the front of expensive ML UDFs under the independence assumption in queries. Input records that do not satisfy query predicates are filtered early by proxy models to bypass ML UDFs. But enforcing the independence assumption may result in sub-optimal plans. We use correlative proxy models to better exploit predicate correlations and accelerate ML queries. We will demonstrate our query optimizer called CORE, which builds proxy models online, allocates parameters to each model, and reorders them. We will also show end-to-end query processing with or without proxy models.
more » « less
Full Text Available
Optimizing machine learning inference queries with correlative proxy models

https://doi.org/10.14778/3547305.3547310

Yang, Zhihui; Wang, Zuozhi; Huang, Yicong; Lu, Yao; Li, Chen; Wang, X. Sean (June 2022, Proceedings of the VLDB Endowment)

We consider accelerating machine learning (ML) inference queries on unstructured datasets. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable with classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) assume independence among the query predicates, build a proxy model for each predicate offline, and rewrite a new query by injecting these cheap proxy models in the front of the expensive ML UDFs. In such a manner, unlikely inputs that do not satisfy query predicates are filtered early to bypass the ML UDFs. We show that enforcing the independence assumption in this context may result in sub-optimal plans. In this paper, we propose CORE, a query optimizer that better exploits the predicate correlations and accelerates ML inference queries. Our solution builds the proxy models online for a new query and leverages a branch-and-bound search process to reduce the building costs. Results on three real-world text, image and video datasets show that CORE improves the query throughput by up to 63% compared to PP and up to 80% compared to running the queries as it is.
more » « less
Full Text Available

Search for: All records