NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

In-memory Incremental Maintenance of Provenance Sketches

https://doi.org/10.48786/edbt.2026.05

Li, Pengyuan; Glavic, Boris; Gawlick, Dieter; Krishnaswamy, Vasudha; Liu, Zhen Hua; Porobic, Danica; Niu, Xing (January 2026, OpenProceedings.org)
Refining Labeling Functions with Limited Labeled Data

https://doi.org/10.1145/3711896.3737102

Li, Chenjie; Gilad, Amir; Glavic, Boris; Miao, Zhengjie; Roy, Sudeepa (August 2025, ACM)

Free, publicly-accessible full text available August 3, 2026
Stress-Testing ML Pipelines with Adversarial Data Corruption

https://doi.org/10.14778/3749646.3749721

Zhu, Jiongli; Xu, Geyang; Lorenzi, Felipe; Glavic, Boris; Salimi, Babak (July 2025, Proceedings of the VLDB Endowment)

Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows.
more » « less
Free, publicly-accessible full text available July 1, 2026
Zorro: Quantifying Uncertainty in Models & Predictions Arising from Dirty Data

https://doi.org/10.1145/3722212.3725143

Hu, Kaiyuan; Zhu, Jiongli; Glavic, Boris; Salimi, Babak (June 2025, ACM)

Free, publicly-accessible full text available June 22, 2026
Alsatian: Optimizing Model Search for Deep Transfer Learning

https://doi.org/10.1145/3725264

Strassenburg, Nils; Glavic, Boris; Rabl, Tilmann (June 2025, Proceedings of the ACM on Management of Data)

Transfer learning is an effective technique for tuning a deep learning model when training data or computational resources are limited. Instead of training a new model from scratch, the parameters of an existing base model are adjusted for the new task. The accuracy of such a fine-tuned model depends on the suitability of the base model chosen. Model search automates the selection of such a base model by evaluating the suitability of candidate models for a specific task. This entails inference with each candidate model on task-specific data. With thousands of models available through model stores, the computational cost of model search is a major bottleneck for efficient transfer learning. In this work, we presentAlsatian, a novel model search system. Based on the observation that many candidate models overlap to a significant extent and following a careful bottleneck analysis, we propose optimization techniques that are applicable to many model search frameworks. These optimizations include: (i) splitting models into individual blocks that can be shared across models, (ii) caching of intermediate inference results and model blocks, and (iii) selecting a beneficial search order for models to maximize sharing of cached results. In our evaluation on state-of-the-art deep learning models from computer vision and natural language processing, we show thatAlsatianoutperforms baselines by up to 14x.
more » « less
Free, publicly-accessible full text available June 17, 2026
Smallest Synthetic Witnesses for Conjunctive Queries

https://doi.org/10.1145/3725250

Esmailpour, Aryan; Glavic, Boris; Hu, Xiao; Sintos, Stavros (June 2025, Proceedings of the ACM on Management of Data)

Given a self-join-free conjunctive queryQand a set of tuplesS, asynthetic witness Dis a database instance such that the result ofQonDisS. In this work, we are interested in two problems. First, the existence problem ESW decides whether any synthetic witnessDexists. Second, given that a synthetic witness exists, the minimization problem SSW computes a synthetic witness of minimal size. The SSW problem is related to thesmallest witness problemrecently studied by Hu and Sintos [22]; however, the objective and the results are inherently different. More specifically, we show that SSW is poly-time solvable for a wider range of queries. Interestingly, in some cases, SSW is related to optimization problems in other domains, such as therole miningproblem in data mining and theedge concentrationproblem in graph drawing. Solutions to ESW and SSW are of practical interest, e.g., fortest database generationfor applications accessing a database and fordata compressionby encoding a datasetSas a pair of a queryQand databaseD. We prove that ESW is in P, presenting a simple algorithm that, given anyS, decides whether a synthetic witness exists in polynomial time in the size ofS. Next, we focus on the SSW problem. We show an algorithm that computes a minimal synthetic witness in polynomial time with respect to the size ofSfor any queryQthat has thehead-dominationproperty. IfQdoes not have such a property, then SSW is generally hard. More specifically, we show that for the class ofpath queries(of any constant length), SSW cannot be solved in polynomial time unless P = NP. We then extend this hardness result to the class ofBerge-acyclicqueries that do not have the head-domination property, obtaining a full dichotomy of SSW for Berge-acyclic queries. Finally, we investigate the hardness of SSW beyond Berge-acyclic queries by showing that SSW cannot be solved in polynomial time for some cyclic queries unless P = NP.
more » « less
Free, publicly-accessible full text available June 9, 2026
Learning from Uncertain Data: From Possible Worlds to Possible Models

Zhu, Jiongli; Feng, Su; Glavic, Boris; Salimi, Babak (February 2025, NeurIPS 2024)

We introduce an efficient method for learning linear models from uncertain data, where uncertainty is represented as a set of possible variations in the data, leading to predictive multiplicity. Our approach leverages abstract interpretation and zonotopes, a type of convex polytope, to compactly represent these dataset variations, enabling the symbolic execution of gradient descent on all possible worlds simultaneously. We develop techniques to ensure that this process converges to a fixed point and derive closed-form solutions for this fixed point. Our method provides sound over-approximations of all possible optimal models and viable prediction ranges. We demonstrate the effectiveness of our approach through theoretical and empirical analysis, highlighting its potential to reason about model and prediction uncertainty due to data quality issues in training data.
more » « less
Free, publicly-accessible full text available February 13, 2026
FastPDB: Towards Bag-Probabilistic Queries at Interactive Speeds

https://doi.org/10.1145/3709691

Huber, Aaron; Kennedy, Oliver; Rudra, Atri; Zhao, Zhuoyue; Feng, Su; Glavic, Boris (February 2025, Proceedings of the ACM on Management of Data)

Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. In this work, we study computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective. We employ concepts from fine-grained complexity to demonstrate that exact bag probabilistic query processing is fundamentally less efficient than deterministic bag query evaluation, but that fast approximations are possible by sampling monomials from a circuit representation of a result tuple's lineage. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation inFastPDBprovides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods.
more » « less
Free, publicly-accessible full text available February 10, 2026
Solving Why Not Questions for Aggregate Constraints Through Query Repair

https://doi.org/10.1109/EuroSPW61312.2024.00072

Algarni, Shatha; Glavic, Boris; Lee, Seokki; Chapman, Adriane (July 2024, IEEE)

Full Text Available
SCIPIS: Scalable and concurrent persistent indexing and search in high-end computing systems

https://doi.org/10.1016/j.jpdc.2024.104878

Orhean, Alexandru Iulian; Giannakou, Anna; Ramakrishnan, Lavanya; Chard, Kyle; Glavic, Boris; Raicu, Ioan (July 2024, Journal of Parallel and Distributed Computing)

Full Text Available

« Prev Next »

Search for: All records