skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2420691

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available August 3, 2026
  2. Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows. 
    more » « less
    Free, publicly-accessible full text available July 1, 2026
  3. Free, publicly-accessible full text available June 22, 2026
  4. Transfer learning is an effective technique for tuning a deep learning model when training data or computational resources are limited. Instead of training a new model from scratch, the parameters of an existing base model are adjusted for the new task. The accuracy of such a fine-tuned model depends on the suitability of the base model chosen. Model search automates the selection of such a base model by evaluating the suitability of candidate models for a specific task. This entails inference with each candidate model on task-specific data. With thousands of models available through model stores, the computational cost of model search is a major bottleneck for efficient transfer learning. In this work, we presentAlsatian, a novel model search system. Based on the observation that many candidate models overlap to a significant extent and following a careful bottleneck analysis, we propose optimization techniques that are applicable to many model search frameworks. These optimizations include: (i) splitting models into individual blocks that can be shared across models, (ii) caching of intermediate inference results and model blocks, and (iii) selecting a beneficial search order for models to maximize sharing of cached results. In our evaluation on state-of-the-art deep learning models from computer vision and natural language processing, we show thatAlsatianoutperforms baselines by up to 14x. 
    more » « less
    Free, publicly-accessible full text available June 17, 2026
  5. Given a self-join-free conjunctive queryQand a set of tuplesS, asynthetic witness Dis a database instance such that the result ofQonDisS. In this work, we are interested in two problems. First, the existence problem ESW decides whether any synthetic witnessDexists. Second, given that a synthetic witness exists, the minimization problem SSW computes a synthetic witness of minimal size. The SSW problem is related to thesmallest witness problemrecently studied by Hu and Sintos [22]; however, the objective and the results are inherently different. More specifically, we show that SSW is poly-time solvable for a wider range of queries. Interestingly, in some cases, SSW is related to optimization problems in other domains, such as therole miningproblem in data mining and theedge concentrationproblem in graph drawing. Solutions to ESW and SSW are of practical interest, e.g., fortest database generationfor applications accessing a database and fordata compressionby encoding a datasetSas a pair of a queryQand databaseD. We prove that ESW is in P, presenting a simple algorithm that, given anyS, decides whether a synthetic witness exists in polynomial time in the size ofS. Next, we focus on the SSW problem. We show an algorithm that computes a minimal synthetic witness in polynomial time with respect to the size ofSfor any queryQthat has thehead-dominationproperty. IfQdoes not have such a property, then SSW is generally hard. More specifically, we show that for the class ofpath queries(of any constant length), SSW cannot be solved in polynomial time unless P = NP. We then extend this hardness result to the class ofBerge-acyclicqueries that do not have the head-domination property, obtaining a full dichotomy of SSW for Berge-acyclic queries. Finally, we investigate the hardness of SSW beyond Berge-acyclic queries by showing that SSW cannot be solved in polynomial time for some cyclic queries unless P = NP. 
    more » « less
    Free, publicly-accessible full text available June 9, 2026
  6. We introduce an efficient method for learning linear models from uncertain data, where uncertainty is represented as a set of possible variations in the data, leading to predictive multiplicity. Our approach leverages abstract interpretation and zonotopes, a type of convex polytope, to compactly represent these dataset variations, enabling the symbolic execution of gradient descent on all possible worlds simultaneously. We develop techniques to ensure that this process converges to a fixed point and derive closed-form solutions for this fixed point. Our method provides sound over-approximations of all possible optimal models and viable prediction ranges. We demonstrate the effectiveness of our approach through theoretical and empirical analysis, highlighting its potential to reason about model and prediction uncertainty due to data quality issues in training data. 
    more » « less
    Free, publicly-accessible full text available February 13, 2026
  7. Probabilistic databases (PDBs) provide users with a principled way to query data that is incomplete or imprecise. In this work, we study computing expected multiplicities of query results over probabilistic databases under bag semantics which has PTIME data complexity. However, does this imply that bag probabilistic databases are practical? We strive to answer this question from both a theoretical as well as a systems perspective. We employ concepts from fine-grained complexity to demonstrate that exact bag probabilistic query processing is fundamentally less efficient than deterministic bag query evaluation, but that fast approximations are possible by sampling monomials from a circuit representation of a result tuple's lineage. A remaining issue, however, is that constructing such circuits, while in PTIME, can nonetheless have significant overhead. To avoid this cost, we utilize approximate query processing techniques to directly sample monomials without materializing lineage upfront. Our implementation inFastPDBprovides accurate anytime approximation of probabilistic query answers and scales to datasets orders of magnitude larger than competing methods. 
    more » « less
    Free, publicly-accessible full text available February 10, 2026