NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Physical Visualization Design: Decoupling Interface and System Design

Chen, Yiru; Li, Xupeng; Tao, Jeffrey; Ramjit, Lana; Mitra, Subrata; Ghaderi, Javad; Netravali, Ravi; Parameswaran, Aditya; Rubenstein, Dan; Wu, Eugene (June 2025, ACM)

Interactive visualization interfaces enable users to efficiently explore, analyze, and make sense of their datasets. However, as data grows in size, it becomes increasingly challenging to build data interfaces that meet the interface designer’s desired latency expectations and resource constraints. Cloud DBMSs, while optimized for scalability, often fail to meet latency expectations, necessitating complex, bespoke query execution and optimization techniques for data interfaces. This involves manually navigating a huge optimization space that is sensitive to interface design and resource constraints, such as client vs server data and compute placement, choosing which computations are done offline vs online, and selecting from a large library of visualization-optimized data structures. This paper advocates for a Physical Visualization Design (PVD) tool that decouples interface design from system design to provide design independence. Given an interfaces underlying data flow, interactions with latency expectations, and resource constraints, PVD checks if the interface is feasible and, if so, proposes and instantiates a middleware architecture spanning the client, server, and cloud DBMS that meets the expectations. To this end, this paper presents Jade, the first prototype PVD tool that enables design independence. Jade proposes an intermediate representation called Diffplans to represent the data flows, develops cost estimation models that trade off between latency guarantees and plan feasibility, and implements an optimization framework to search for the middleware architecture that meets the guarantees. We evaluate Jade on six representative data interfaces as compared to Mosaic and Azure SQL database. We find Jade supports a wider range of interfaces, makes better use of available resources, and can meet a wider range of data, latency, and resource conditions.
more » « less
Free, publicly-accessible full text available June 20, 2026
SET: Searching Effective Supervised Learning Augmentations in Large Tabular Data Repositories

Liu, Jiaxiang; Huang, Zezhou; Wu, Eugene (June 2024, GUIDE-AI Workshop)

Successful supervised learning models rely on predictive features, which rarely come from a single dataset. As a result, relevant datasets need to be integrated before training the actual model. This raises one natural question: \textit{``how can one efficiently search for predictive features from relevant datasets for integration with responsible AI guarantees?"}. This paper formalizes the question as the \textit{data augmentation search problem} with an objective of minimizing the search latency. We propose \sys, an interactive system that intakes a supervised learning task and searches for a set of join-compatible datasets that optimally improve the performance of the task. Specifically, \sys manages a corpus of relational datasets, uses linear regression as a \textit{proxy model} to evaluate augmentation candidates, and applies \textit{factorized machine learning} to accelerate model training and evaluation algorithmically. Furthermore, \sys leverages system and hardware optimizations to maximize parallelism across augmentation searches. These allow \sys to search for a good augmentation plan over 1 million datasets with a latency of $1.4$ seconds.
more » « less
Full Text Available
The Fast and the Private: Task-based Dataset Search

Huang, Zezou; Liu, Jiaxiang; Wang, Haonan; Wu, Eugene (January 2024, Conference on Innovative Data Systems Research)

Recent platforms utilize ML task performance metrics, not metadata keywords, to search large data corpus. Requesters provide an initial dataset, and the platform searches for additional datasets that augment---join or union---requester's dataset to most improve the model (e.g., linear regression) performance. Although effective, current task-based data searches are stymied by (1) high latency which deters users, (2) privacy concerns for regulatory standards, and (3) low data quality which provides low utility. We introduce Mileena, a fast, private, and high-quality task-based dataset search platform. At its heart, Mileena is built on pre-computed semi-ring sketches for efficient ML training and evaluation. Based on semi-ring, we develop a novel Factorized Privacy Mechanism that makes the search differentially private and scales to arbitrary corpus sizes and numbers of requests without major quality degradation. We also demonstrate the early promise in using LLM-based agents for automatic data transformation and applying semi-rings to support causal discovery and treatment effect estimation.
more » « less
Full Text Available
JoinBoost: Grow Trees over Normalized Data Using Only SQL

https://doi.org/10.14778/3611479.3611509

Huang, Zezhou; Sen, Rathijit; Liu, Jiaxiang; Wu, Eugene (July 2023, Proceedings of the VLDB Endowment)

Although dominant for tabular data, ML libraries that train tree models over normalized databases (e.g., LightGBM, XGBoost) require the data to be denormalized as a single table, materialized, and exported. This process is not scalable, slow, and poses security risks. In-DB ML aims to train models within DBMSes to avoid data movement and provide data governance. Rather than modify a DBMS to support In-DB ML, is it possible to offer competitive tree training performance to specialized ML libraries...with only SQL? We present JoinBoost, a Python library that rewrites tree training algorithms over normalized databases into pure SQL. It is portable to any DBMS, offers performance competitive with specialized ML libraries, and scales with the underlying DBMS capabilities. JoinBoost extends prior work from both algorithmic and systems perspectives. Algorithmically, we support factorized gradient boosting, by updating theYvariable to the residual in thenon-materialized join result.Although this view update problem is generally ambiguous, we identifyaddition-to-multiplication preserving, the key property of variance semi-ring to supportrmsethe most widely used criterion. System-wise, we identify residual updates as a performance bottleneck. Such overhead can be natively minimized on columnar DBMSes by creating a new column of residual values and adding it as a projection. We validate this with two implementations on DuckDB, with no or minimal modifications to its internals for portability. Our experiment shows that JoinBoost is 3× (1.1×) faster for random forests (gradient boosting) compared to LightGBM, and over an order of magnitude faster than state-of-the-art In-DB ML systems. Further, JoinBoost scales well beyond LightGBM in terms of the # features, DB size (TPC-DS SF=1000), and join graph complexity (galaxy schemas).
more » « less
Full Text Available
Adaptive code generation for data-intensive analytics

https://doi.org/10.14778/3447689.3447697

Zhang, Wangda; Kim, Junyoung; Ross, Kenneth A.; Sedlar, Eric; Stadler, Lukas (February 2021, Proceedings of the VLDB Endowment)
null (Ed.)
Modern database management systems employ sophisticated query optimization techniques that enable the generation of efficient plans for queries over very large data sets. A variety of other applications also process large data sets, but cannot leverage database-style query optimization for their code. We therefore identify an opportunity to enhance an open-source programming language compiler with database-style query optimization. Our system dynamically generates execution plans at query time, and runs those plans on chunks of data at a time. Based on feedback from earlier chunks, alternative plans might be used for later chunks. The compiler extension could be used for a variety of data-intensive applications, allowing all of them to benefit from this class of performance optimizations.
more » « less
Full Text Available
DIEL: Interactive Visualization Beyond the Here and Now

Wu, Yifan; Chang, Remco; Hellerstein, Joseph; Satyanarayan, Arvind; Wu, Eugene (January 2021, IEEE Vis)
null (Ed.)
Full Text Available

Search for: All records