Ranking evaluation metrics play an important role in information retrieval, providing optimization objectives during development and means of assessment of deployed performance. Recently, fairness of rankings has been recognized as crucial, especially as automated systems are increasingly used for high impact decisions. While numerous fairness metrics have been proposed, a comparative analysis to understand their interrelationships is lacking. Even for fundamental statistical parity metrics which measure group advantage, it remains unclear whether metrics measure the same phenomena, or when one metric may produce different results than another. To address these open questions, we formulate a conceptual framework for analytical comparison of metrics.We prove that under reasonable assumptions, popular metrics in the literature exhibit the same behavior and that optimizing for one optimizes for all. However, our analysis also shows that the metrics vary in the degree of unfairness measured, in particular when one group has a strong majority. Based on this analysis, we design a practical statistical test to identify whether observed data is likely to exhibit predictable group bias. We provide a set of recommendations for practitioners to guide the choice of an appropriate fairness metric.
Testing Group Fairness via Optimal Transport Projections
InProceedings{pmlr-v139-si21a,
title = {},
author = {},
booktitle = {},
pages = {9649--9659},
We have developed a statistical testing framework to detect if a given machine learning classifier fails to satisfy a wide range of group fairness notions. Our test is a flexible, interpretable, and statistically rigorous tool for auditing whether exhibited biases are intrinsic to the algorithm or simply due to the randomness in the data. The statistical challenges, which may arise from multiple impact criteria that define group fairness and which are discontinuous on model parameters, are conveniently tackled by projecting the empirical measure to the set of group-fair probability models using optimal transport. This statistic is efficiently computed using linear programming, and its asymptotic distribution is explicitly obtained. The proposed framework can also be used to test for composite fairness hypotheses and fairness with multiple sensitive attributes. The optimal transport testing formulation improves interpretability by characterizing the minimal covariate perturbations that eliminate the bias observed in the audit.
- Editors:
- Meila, Marina and
- Award ID(s):
- 1915967
- Publication Date:
- NSF-PAR ID:
- 10344979
- Journal Name:
- Proceedings of the 38th International Conference on Machine Learning
- Volume:
- 139
- Issue:
- 2021
- Page Range or eLocation-ID:
- 9649--9659
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over k elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problems under (epsilon, delta)-differential privacy. We provide sample optimal algorithms for identity testing problem for all parameter ranges, and the first results for closeness testing. Our closeness testing bounds are optimal in the sparse regime where the number of samples is at most k. Our upper bounds are obtained by privatizing non-private estimators for these problems. The non-private estimators are chosen to have small sensitivity. We propose a general framework to establish lower bounds on the sample complexity of statistical tasks under differential privacy. We show a bound on di erentially private algorithms in terms of a coupling between the two hypothesis classes we aim to test. By carefully constructing chosen priors over the hypothesis classes, and using Le Cam’s two point theorem we provide a general mechanism for proving lower bounds. We believe that the framework can bemore »
-
Abstract Particle filters avoid parametric estimates for Bayesian posterior densities, which alleviates Gaussian assumptions in nonlinear regimes. These methods, however, are more sensitive to sampling errors than Gaussian-based techniques such as ensemble Kalman filters. A recent study by the authors introduced an iterative strategy for particle filters that match posterior moments—where iterations improve the filter’s ability to draw samples from non-Gaussian posterior densities. The iterations follow from a factorization of particle weights, providing a natural framework for combining particle filters with alternative filters to mitigate the impact of sampling errors. The current study introduces a novel approach to forming an adaptive hybrid data assimilation methodology, exploiting the theoretical strengths of nonparametric and parametric filters. At each data assimilation cycle, the iterative particle filter performs a sequence of updates while the prior sample distribution is non-Gaussian, then an ensemble Kalman filter provides the final adjustment when Gaussian distributions for marginal quantities are detected. The method employs the Shapiro–Wilk test to determine when to make the transition between filter algorithms, which has outstanding power for detecting departures from normality. Experiments using low-dimensional models demonstrate that the approach has a significant value, especially for nonhomogeneous observation networks and unknown model process errors. Moreover,more »
-
Pantea, Casian (Ed.)Limited testing capacity for COVID-19 has hampered the pandemic response. Pooling is a testing method wherein samples from specimens (e.g., swabs) from multiple subjects are combined into a pool and screened with a single test. If the pool tests positive, then new samples from the collected specimens are individually tested, while if the pool tests negative, the subjects are classified as negative for the disease. Pooling can substantially expand COVID-19 testing capacity and throughput, without requiring additional resources. We develop a mathematical model to determine the best pool size for different risk groups , based on each group’s estimated COVID-19 prevalence. Our approach takes into consideration the sensitivity and specificity of the test, and a dynamic and uncertain prevalence, and provides a robust pool size for each group. For practical relevance, we also develop a companion COVID-19 pooling design tool (through a spread sheet). To demonstrate the potential value of pooling, we study COVID-19 screening using testing data from Iceland for the period, February-28-2020 to June-14-2020, for subjects stratified into high- and low-risk groups. We implement the robust pooling strategy within a sequential framework, which updates pool sizes each week, for each risk group, based on prior week’s testing data.more »
-
Much of the work in the field of group fairness addresses disparities between predefined groups based on protected features such as gender, age, and race, which need to be available at train, and often also at test, time. These approaches are static and retrospective, since algorithms designed to protect groups identified a priori cannot anticipate and protect the needs of different at-risk groups in the future. In this work we analyze the space of solutions for worst-case fairness beyond demographics, and propose Blind Pareto Fairness (BPF), a method that leverages no-regret dynamics to recover a fair minimax classifier that reduces worst-case risk of any potential subgroup of sufficient size, and guarantees that the remaining population receives the best possible level of service. BPF addresses fairness beyond demographics, that is, it does not rely on predefined notions of at-risk groups, neither at train nor at test time. Our experimental results show that the proposed framework improves worst-case risk in multiple standard datasets, while simultaneously providing better levels of service for the remaining population, in comparison to competing methods.