NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds

https://doi.org/10.1145/3709691

Huber, Aaron; Kennedy, Oliver; Rudra, Atri; Zhao, Zhuoyue; Feng, Su; Glavic, Boris (February 2025, Proceedings of the ACM on Management of Data)

Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. In this work, we study computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective. We employ concepts from fine-grained complexity to demonstrate that exact bag probabilistic query processing is fundamentally less efficient than deterministic bag query evaluation, but that fast approximations are possible by sampling monomials from a circuit representation of a result tuple's lineage. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation inFastPDBprovides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.
more » « less
Free, publicly-accessible full text available February 10, 2026
Drag, Drop, Merge: A Tool for Streamlining Integration of Longitudinal Survey Instruments

https://doi.org/10.1145/3665939.3665965

Pokharel, Pratik; Lee, Juseung; Kennedy, Oliver; Markatou, Marianthi; Talal, Andrew; Good, Jeff; Mukhopadhyay, Raktim (June 2024, ACM)

We explore data management for longitudinal study survey instruments: (i) Survey instrument evolution presents a unique data integration challenge; and (ii) Longitudinal study data frequently requires repeated, task-specific integration efforts. We present DDM (Drag, Drop, Merge), a user interface for documenting relationships among attributes of source schemas into a form that can streamline subsequent efforts to generate task-specific datasets. DDM employs a "human-in-the-loop" approach, allowing users to validate and refine semantic mappings. Through a simulation of user interactions with DDM, we demonstrate its viability as a way to reduce cognitive overhead for longitudinal study data curators.
more » « less
Full Text Available
Overlay Spreadsheets

https://doi.org/10.1145/3597465.3605220

Kennedy, Oliver; Glavic, Boris; Brachmann, Michael (June 2023, HILDA '23: Proceedings of the Workshop on Human-In-the-Loop Data Analytics)

Full Text Available
The Right Tool for the Job: Data-Centric Workflows in Vizier

Oliver Kennedy, Boris Glavic (September 2022, Bulletin of the Technical Committee on Data Engineering)
Sudeepa Roy and Jun Yang (Ed.)
Data scientists use a wide variety of systems with a wide variety of user interfaces such as spreadsheets and notebooks for their data exploration, discovery, preprocessing, and analysis tasks. While this wide selection of tools offers data scientists the freedom to pick the right tool for each task, each of these tools has limitations (e.g., the lack of reproducibility of notebooks), data needs to be translated between tool-specific formats, and common functionality such as versioning, provenance, and dealing with data errors often has to be implemented for each system. We argue that rather than alternating between task-specific tools, a superior approach is to build multiple user-interfaces on top of a single incremental workflow / dataflow platform with built-in support for versioning, provenance, error & tracking, and data cleaning. We discuss Vizier, a notebook system that implements this approach, introduce the challenges that arose in building such a system, and highlight how our work on Vizier lead to novel research in uncertain data management and incremental execution of workflows.
more » « less
Full Text Available
Runtime provenance refinement for notebooks

https://doi.org/10.1145/3530800.3534535

Deo, Nachiket; Glavic, Boris; Kennedy, Oliver (June 2022, Proceedings of the 14th International Workshop on the Theory and Practice of Provenance)

Full Text Available
Efficient Uncertainty Tracking for Complex Queries with Attribute-level Bounds

https://doi.org/10.1145/3448016.3452791

Feng, Su; Glavic, Boris; Huber, Aaron; Kennedy, Oliver A. (June 2021, SIGMOD '21: International Conference on Management of Data)
null (Ed.)
Incomplete and probabilistic database techniques are principled methods for coping with uncertainty in data. unfortunately, the class of queries that can be answered efficiently over such databases is severely limited, even when advanced approximation techniques are employed. We introduce attribute-annotated uncertain databases (AU-DBs), an uncertain data model that annotates tuples and attribute values with bounds to compactly approximate an incomplete database. AU-DBs are closed under relational algebra with aggregation using an efficient evaluation semantics. Using optimizations that trade accuracy for performance, our approach scales to complex queries and large datasets, and produces accurate results.
more » « less
Full Text Available

Search for: All records