In this paper, we interrogate whether data quality issues track demographic group membership (based on sex, race and age) and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature. We first analyse the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations. We observe that, while automated data cleaning is unlikely to worsen accuracy, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. Furthermore, we find that the positive or negative impact of a particular cleaning technique often depends on the choice of fairness metric and group definition (single-attribute or intersectional). We make our code and experimental results publicly available. The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported by data engineering tools, and requires substantial data engineering research. Towards this goal, we discuss open research questions, envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.
more »
« less
Automated Data Cleaning Can Hurt Fairness in Machine Learning-based Decision Making
In this paper, we interrogate whether data quality issues track demographic characteristics such as sex, race and age, and whether automated data cleaning — of the kind commonly used in production ML systems — impacts the fairness of predictions made by these systems. To the best of our knowledge, the impact of data cleaning on fairness in downstream tasks has not been investigated in the literature.We first analyze the tuples flagged by common error detection strategies in five research datasets. We find that, while specific data quality issues, such as higher rates of missing values, are associated with membership in historically disadvantaged groups, poor data quality does not generally track demographic group membership. As a follow-up, we conduct a large-scale empirical study on the impact of automated data cleaning on fairness, involving more than 26,000 model evaluations on five datasets. We observe that, while automated data cleaning has an insignificant impact on both accuracy and fairness in the majority of cases, it is more likely to worsen fairness than to improve it, especially when the cleaning techniques are not carefully chosen. This finding is both significant and worrying, given that it potentially implicates many production ML systems. We make our code and experimental results publicly available.The analysis we conducted in this paper is difficult, primarily because it requires that we think holistically about disparities in data quality, disparities in the effectiveness of data cleaning methods, and impacts of such disparities on ML model performance for different demographic groups. Such holistic analysis can and should be supported with the help of data engineering research. Towards this goal, we envision the development of fairness-aware data cleaning methods, and their integration into complex pipelines for ML-based decision making.
more »
« less
- PAR ID:
- 10437306
- Date Published:
- Journal Name:
- 2023 IEEE 39th International Conference on Data Engineering (ICDE)
- Page Range / eLocation ID:
- 3747 to 3754
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Structured data-quality issues—such as missing values correlated with demographics, culturally biased labels, or systemic selection biases—routinely degrade the reliability of machine-learning pipelines. Regulators now increasingly demand evidence that high-stakes systems can withstand these realistic, interdependent errors, yet current robustness evaluations typically use random or overly simplistic corruptions, leaving worst-case scenarios unexplored. We introduce Savage, a causally inspired framework that (i) formally models realistic data-quality issues through dependency graphs and flexible corruption templates, and (ii) systematically discovers corruption patterns that maximally degrade a target performance metric. Savage employs a bi-level optimization approach to efficiently identify vulnerable data subpopulations and fine-tune corruption severity, treating the full ML pipeline, including preprocessing and potentially non-differentiable models, as a black box. Extensive experiments across multiple datasets and ML tasks (data cleaning, fairness-aware learning, uncertainty quantification) demonstrate that even a small fraction (around 5%) of structured corruptions identified by Savage severely impacts model performance, far exceeding random or manually crafted errors, and invalidating core assumptions of existing techniques. Thus, Savage provides a practical tool for rigorous pipeline stress-testing, a benchmark for evaluating robustness methods, and actionable guidance for designing more resilient data workflows.more » « less
-
Current algorithmic fairness tools focus on auditing completed models, neglecting the potential downstream impacts of iterative decisions about cleaning data and training machine learning models. In response, we developed Retrograde, a JupyterLab environment extension for Python that generates real-time, contextual notifications for data scientists about decisions they are making regarding protected classes, proxy variables, missing data, and demographic differences in model performance. Our novel framework uses automated code analysis to trace data provenance in JupyterLab, enabling these notifications. In a between-subjects online experiment, 51 data scientists constructed loan-decision models with Retrograde providing notifications continuously throughout the process, only at the end, or never. Retrograde’s notifications successfully nudged participants to account for missing data, avoid using protected classes as predictors, minimize demographic differences in model performance, and exhibit healthy skepticism about their models.more » « less
-
null (Ed.)The increasing impact of algorithmic decisions on people’s lives compels us to scrutinize their fairness and, in particular, the disparate impacts that ostensibly color-blind algorithms can have on different groups. Examples include credit decisioning, hiring, advertising, criminal justice, personalized medicine, and targeted policy making, where in some cases legislative or regulatory frameworks for fairness exist and define specific protected classes. In this paper we study a fundamental challenge to assessing disparate impacts in practice: protected class membership is often not observed in the data. This is particularly a problem in lending and healthcare. We consider the use of an auxiliary data set, such as the U.S. census, to construct models that predict the protected class from proxy variables, such as surname and geolocation. We show that even with such data, a variety of common disparity measures are generally unidentifiable, providing a new perspective on the documented biases of popular proxy-based methods. We provide exact characterizations of the tightest possible set of all possible true disparities that are consistent with the data (and possibly additional assumptions). We further provide optimization-based algorithms for computing and visualizing these sets and statistical tools to assess sampling uncertainty. Together, these enable reliable and robust assessments of disparities—an important tool when disparity assessment can have far-reaching policy implications. We demonstrate this in two case studies with real data: mortgage lending and personalized medicine dosing. This paper was accepted by Hamid Nazerzadeh, Guest Editor for the Special Issue on Data-Driven Prescriptive Analytics.more » « less
-
Federated Survival Analysis (FSA) is an emerging Federated Learning (FL) paradigm that enables training survival models on decentralized data while preserving privacy. However, existing FSA approaches largely overlook the potential risk of bias in predictions arising from demographic and censoring disparities across clients' datasets, which impacts the fairness and performance of federated survival models, especially for underrepresented groups. To address this gap, we introduce FairFSA, a novel FSA framework that adapts existing fair survival models to the federated setting. FairFSA jointly trains survival models using distributionally robust optimization, penalizing worst-case errors across subpopulations that exceed a specified probability threshold. Partially observed survival outcomes in clients are reconstructed with federated pseudo values (FPV) before model training to address censoring. Furthermore, we design a weight aggregation strategy by enhancing the FedAvg algorithm with a fairness-aware concordance index-based aggregation method to foster equitable performance distribution across clients. To the best of our knowledge, this is the first work to study and integrate fairness into Federated Survival Analysis. Comprehensive experiments on distributed non-IID datasets demonstrate FairFSA's superiority in fairness and accuracy over state-of-the-art FSA methods, establishing it as a robust FSA approach capable of handling censoring while providing equitable and accurate survival predictions for all subjects.more » « less
An official website of the United States government

