Stress-Testing ML Pipelines with Adversarial Data Corruption

Zhu, Jiongli; Xu, Geyang; Lorenzi, Felipe; Glavic, Boris; Salimi, Babak

doi:10.14778/3749646.3749721

Citation Details

This content will become publicly available on July 1, 2026

Stress-Testing ML Pipelines with Adversarial Data Corruption

Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows. more »

Award ID(s):: 2420577 2420691

PAR ID:: 10639846

Author(s) / Creator(s):: Zhu, Jiongli; Xu, Geyang; Lorenzi, Felipe; Glavic, Boris; Salimi, Babak

Publisher / Repository:: VLDB Endowment

Date Published:: 2025-07-01

Journal Name:: Proceedings of the VLDB Endowment

Volume:: 18

Issue:: 11

ISSN:: 2150-8097

Page Range / eLocation ID:: 4668 to 4681

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on July 1, 2026
Journal Article:
https://doi.org/10.14778/3749646.3749721

More Like this