NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Zorro: Quantifying Uncertainty in Models & Predictions Arising from Dirty Data

https://doi.org/10.1145/3722212.3725143

Hu, Kaiyuan; Zhu, Jiongli; Glavic, Boris; Salimi, Babak (June 2025, ACM)

Free, publicly-accessible full text available June 22, 2026
Stress-Testing ML Pipelines with Adversarial Data Corruption

https://doi.org/10.14778/3749646.3749721

Zhu, Jiongli; Xu, Geyang; Lorenzi, Felipe; Glavic, Boris; Salimi, Babak (July 2025, Proceedings of the VLDB Endowment)

Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows.
more » « less
Free, publicly-accessible full text available July 1, 2026
Learning from Uncertain Data: From Possible Worlds to Possible Models

Zhu, Jiongli; Feng, Su; Glavic, Boris; Salimi, Babak (February 2025, NeurIPS 2024)

We introduce an efficient method for learning linear models from uncertain data, where uncertainty is represented as a set of possible variations in the data, leading to predictive multiplicity. Our approach leverages abstract interpretation and zonotopes, a type of convex polytope, to compactly represent these dataset variations, enabling the symbolic execution of gradient descent on all possible worlds simultaneously. We develop techniques to ensure that this process converges to a fixed point and derive closed-form solutions for this fixed point. Our method provides sound over-approximations of all possible optimal models and viable prediction ranges. We demonstrate the effectiveness of our approach through theoretical and empirical analysis, highlighting its potential to reason about model and prediction uncertainty due to data quality issues in training data.
more » « less
Free, publicly-accessible full text available February 13, 2026
OTClean: Data Cleaning for Conditional Independence Violations using Optimal Transport

https://doi.org/10.1145/3654963

Pirhadi, Alireza; Moslemi, Mohammad Hossein; Cloninger, Alexander; Milani, Mostafa; Salimi, Babak (May 2024, Proceedings of the ACM on Management of Data (SIGMOD))

Ensuring Conditional Independence (CI) constraints is pivotal for the development of fair and trustworthy machine learning models. In this paper, we introduce OTClean, a framework that harnesses optimal transport theory for data repair under CI constraints. Optimal transport theory provides a rigorous framework for measuring the discrepancy between probability distributions, thereby ensuring control over data utility. We formulate the data repair problem concerning CIs as a Quadratically Constrained Linear Program (QCLP) and propose an alternating method for its solution. However, this approach faces scalability issues due to the computational cost associated with computing optimal transport distances, such as the Wasserstein distance. To overcome these scalability challenges, we reframe our problem as a regularized optimization problem, enabling us to develop an iterative algorithm inspired by Sinkhorn's matrix scaling algorithm, which efficiently addresses high-dimensional and large-scale data. Through extensive experiments, we demonstrate the efficacy and efficiency of our proposed methods, showcasing their practical utility in real-world data cleaning and preprocessing tasks. Furthermore, we provide comparisons with traditional approaches, highlighting the superiority of our techniques in terms of preserving data utility while ensuring adherence to the desired CI constraints.
more » « less
Full Text Available
Causal What-If and How-To Analysis Using HypeR

https://doi.org/10.1109/ICDE55515.2023.00293

Shen, Fangzhu; Heravi, Kayvon; Gomez, Oscar; Galhotra, Sainyam; Gilad, Amir; Roy, Sudeepa; Salimi, Babak (April 2023, 2023 IEEE 39th International Conference on Data Engineering (ICDE))

Full Text Available
SAFE-PASS: Stewardship, Advocacy, Fairness and Empowerment in Privacy, Accountability, Security, and Safety for Vulnerable Groups

https://doi.org/10.1145/3589608.3593830

Ray, Indrajit; Thuraisingham, Bhavani; Vaidya, Jaideep; Mehrotra, Sharad; Atluri, Vijayalakshmi; Ray, Indrakshi; Kantarcioglu, Murat; Raskar, Ramesh; Salimi, Babak; Simske, Steve; et al (May 2023, ACM)

Full Text Available
Generating Interpretable Data-Based Explanations for Fairness Debugging using Gopher

https://doi.org/10.1145/3514221.3520170

Zhu, Jiongli; Pradhan, Romila; Glavic, Boris; Salimi, Babak (June 2022, ACM SIGMOD)

Full Text Available
Interpretable Data-Based Explanations for Fairness Debugging

https://doi.org/10.1145/3514221.3517886

Pradhan, Romila; Zhu, Jiongli; Glavic, Boris; Salimi, Babak (June 2022, ACM SIGMOD)

Full Text Available
Through the Data Management Lens: Experimental Analysis and Evaluation of Fair Classification

https://doi.org/10.1145/3514221.3517841

Islam, Maliha Tashfia; Fariha, Anna; Meliou, Alexandra; Salimi, Babak (June 2022, Proceedings of the 2022 International Conference on Management of Data (SIGMOD))

Full Text Available
HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach

https://doi.org/10.1145/3514221.3526149

Galhotra, Sainyam; Gilad, Amir; Roy, Sudeepa; Salimi, Babak (January 2022, SIGMOD'22: International Conference on Management of Data)

Full Text Available

« Prev Next »

Search for: All records