Machine learning models are increasingly being used in important decision-making software such as approving bank loans, recommending criminal sentencing, hiring employees, and so on. It is important to ensure the fairness of these models so that no discrimination is made between different groups in a protected attribute (e.g., race, sex, age) while decision making. Algorithms have been developed to measure unfairness and mitigate them to a certain extent. In this paper, we have focused on the empirical evaluation of fairness and mitigations on real-world machine learning models. We have created a benchmark of 40 top-rated models from Kaggle used for 5 different tasks, and then using a comprehensive set of fairness metrics evaluated their fairness. Then, we have applied 7 mitigation techniques on these models and analyzed the fairness, mitigation results, and impacts on performance. We have found that some model optimization techniques result in inducing unfairness in the models. On the other hand, although there are some fairness control mechanisms in machine learning libraries, they are not documented. The mitigation algorithm also exhibit common patterns such as mitigation in the post-processing is often costly (in terms of performance) and mitigation in the pre-processing stage is preferred in most cases. We have also presented different trade-off choices of fairness mitigation decisions. Our study suggests future research directions to reduce the gap between theoretical fairness aware algorithms and the software engineering methods to leverage them in practice.
more »
« less
Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline
In recent years, many incidents have been reported where machine learning
models exhibited discrimination among people based on race, sex, age, etc.
Research has been conducted to measure and mitigate unfairness in machine
learning models. For a machine learning task, it is a common practice to
build a pipeline that includes an ordered set of data preprocessing stages
followed by a classifier. However, most of the research on fairness has
considered a single classifier based prediction task. What are the fairness
impacts of the preprocessing stages in machine learning pipeline? Furthermore,
studies showed that often the root cause of unfairness is ingrained in the
data itself, rather than the model. But no research has been conducted to
measure the unfairness caused by a specific transformation made in the data
preprocessing stage. In this paper, we introduced the causal method of fairness
to reason about the fairness impact of data preprocessing stages in ML pipeline.
We leveraged existing metrics to define the fairness measures of the stages.
Then we conducted a detailed fairness evaluation of the preprocessing stages
in 37 pipelines collected from three different sources. Our results show
that certain data transformers are causing the model to exhibit unfairness.
We identified a number of fairness patterns in several categories of data
transformers. Finally, we showed how the local fairness of a preprocessing
stage composes in the global fairness of the pipeline. We used the fairness
composition to choose appropriate downstream transformer that mitigates
unfairness in the machine learning pipeline.
more »
« less
- NSF-PAR ID:
- 10287421
- Date Published:
- Journal Name:
- ESEC/FSE’2021: The 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Aug. 2021).
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Machine learning models are increasingly being used in important decision-making software such as approving bank loans, recommending criminal sentencing, hiring employees, and so on. It is important to ensure the fairness of these models so that no discrimination is made based on protected attribute (e.g., race, sex, age) while decision making. Algorithms have been developed to measure unfairness and mitigate them to a certain extent. In this paper, we have focused on the empirical evaluation of fairness and mitigations on real-world machine learning models. We have created a benchmark of 40 top-rated models from Kaggle used for 5 different tasks, and then using a comprehensive set of fairness metrics, evaluated their fairness. Then, we have applied 7 mitigation techniques on these models and analyzed the fairness, mitigation results, and impacts on performance. We have found that some model optimization techniques result in inducing unfairness in the models. On the other hand, although there are some fairness control mechanisms in machine learning libraries, they are not documented. The mitigation algorithm also exhibit common patterns such as mitigation in the post-processing is often costly (in terms of performance) and mitigation in the pre-processing stage is preferred in most cases. We have also presented different trade-off choices of fairness mitigation decisions. Our study suggests future research directions to reduce the gap between theoretical fairness aware algorithms and the software engineering methods to leverage them in practice.more » « less
-
Roth, A (Ed.)It is well understood that a system built from individually fair components may not itself be individually fair. In this work, we investigate individual fairness under pipeline composition. Pipelines differ from ordinary sequential or repeated composition in that individuals may drop out at any stage, and classification in subsequent stages may depend on the remaining “cohort” of individuals. As an example, a company might hire a team for a new project and at a later point promote the highest performer on the team. Unlike other repeated classification settings, where the degree of unfairness degrades gracefully over multiple fair steps, the degree of unfairness in pipelines can be arbitrary, even in a pipeline with just two stages. Guided by a panoply of real-world examples, we provide a rigorous framework for evaluating different types of fairness guarantees for pipelines. We show that naïve auditing is unable to uncover systematic unfairness and that, in order to ensure fairness, some form of dependence must exist between the design of algorithms at different stages in the pipeline. Finally, we provide constructions that permit flexibility at later stages, meaning that there is no need to lock in the entire pipeline at the time that the early stage is constructed.more » « less
-
Fairness metrics have become a useful tool to measure how fair or unfair a machine learning system may be for its stakeholders. In the context of recommender systems, previous research has explored how various stakeholders experience algorithmic fairness or unfairness, but it is also important to capture these experiences in the design of fairness metrics. Therefore, we conducted four focus groups with providers (those whose items, content, or profiles are being recommended) of two different domains: content creators and dating app users. We explored how our participants experience unfairness on their associated platforms, and worked with them to co-design fairness goals, definitions, and metrics that might capture these experiences. This work represents an important step towards designing fairness metrics with the stakeholders who will be impacted by their operationalizations. We analyze the efficacy and challenges of enacting these metrics in practice and explore how future work might benefit from this methodology.more » « less
-
Koyejo, S. ; Mohamed, S. ; Agarwal, A. ; Belgrave, D. ; Cho, K. ; Oh, A. (Ed.)Machine Learning (ML) research has focused on maximizing the accuracy of predictive tasks. ML models, however, are increasingly more complex, resource intensive, and costlier to deploy in resource-constrained environments. These issues are exacerbated for prediction tasks with sequential classification on progressively transitioned stages with “happens-before” relation between them.We argue that it is possible to “unfold” a monolithic single multi-class classifier, typically trained for all stages using all data, into a series of single-stage classifiers. Each single- stage classifier can be cascaded gradually from cheaper to more expensive binary classifiers that are trained using only the necessary data modalities or features required for that stage. UnfoldML is a cost-aware and uncertainty-based dynamic 2D prediction pipeline for multi-stage classification that enables (1) navigation of the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference by orders of magnitude, and (3) early prediction on proceeding stages. UnfoldML achieves orders of magnitude better cost in clinical settings, while detecting multi- stage disease development in real time. It achieves within 0.1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio- temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also show that UnfoldML generalizes to image classification, where it can predict different level of labels (from coarse to fine) given different level of abstractions of a image, saving close to 5X cost with as little as 0.4% accuracy reduction.more » « less