skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Statistical Methods for Assessing Differences in False Non-Match Rates Across Demographic Groups
Biometric recognition is used across a variety of applications from cyber security to border security. Recent research has focused on ensuring biometric performance (false negatives and false positives) is fair across demographic groups. While there has been significant progress on the development of metrics, the evaluation of the performance across groups, and the mitigation of any problems, there has been little work incorporating statistical variation. This is important because differences among groups can be found by chance when no difference is present. In statistics this is called a Type I error. Differences among groups may be due to sampling variation or they may be due to actual difference in system performance. Discriminating between these two sources of error is essential for good decision making about fairness and equity. This paper presents two novel statistical approaches for assessing fairness across demographic groups. The first methodology is a bootstrapped-based hypothesis test, while the second is simpler test methodology focused upon non-statistical audience. For the latter we present the results of a simulation study about the relationship between the margin of error and factors such as number of subjects, number of attempts, correlation between attempts, underlying false non-match rates(FNMR's), and number of groups.  more » « less
Award ID(s):
1650503
PAR ID:
10395558
Author(s) / Creator(s):
Date Published:
Journal Name:
2022 International Conference on Pattern Recognition (ICPR)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data imbalance is a fundamental challenge in ap- plying language models to biomedical applications, particularly in ICD code prediction tasks where label and demographic distributions are uneven. While state-of-the-art language models have been increasingly adopted in biomedical tasks, few studies have systematically examined how data imbalance affects model performance and fairness across demographic groups. This study fills the gap by statistically probing the relationship between data imbalance and model performance in ICD code prediction. We analyze imbalances in a standard benchmark data across gender, age, ethnicity, and social determinants of health by state- of-the-art biomedical language models. By deploying diverse performance metrics and statistical analyses, we explore the influence of data imbalance on performance variations and demographic fairness. Our study shows that data imbalance significantly impacts model performance and fairness, but feature similarity to the majority class may be a more critical factor. We believe this study provides valuable insights for developing more equitable and robust language models in healthcare applications. 
    more » « less
  2. Pollard, Tom J. (Ed.)
    Modern predictive models require large amounts of data for training and evaluation, absence of which may result in models that are specific to certain locations, populations in them and clinical practices. Yet, best practices for clinical risk prediction models have not yet considered such challenges to generalizability. Here we ask whether population- and group-level performance of mortality prediction models vary significantly when applied to hospitals or geographies different from the ones in which they are developed. Further, what characteristics of the datasets explain the performance variation? In this multi-center cross-sectional study, we analyzed electronic health records from 179 hospitals across the US with 70,126 hospitalizations from 2014 to 2015. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by the race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm “Fast Causal Inference” that infers paths of causal influence while identifying potential influences associated with unmeasured variables. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st-3rd quartile or IQR; median 0.801); calibration slope from 0.725 to 0.983 (IQR; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (IQR; median 0.092). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. The race variable also mediated differences in the relationship between clinical variables and mortality, by hospital/region. In conclusion, group-level performance should be assessed during generalizability checks to identify potential harms to the groups. Moreover, for developing methods to improve model performance in new environments, a better understanding and documentation of provenance of data and health processes are needed to identify and mitigate sources of variation. 
    more » « less
  3. Increases in the deployment of machine learning algorithms for applications that deal with sensitive data have brought attention to the issue of fairness in machine learning. Many works have been devoted to applications that require different demographic groups to be treated fairly. However, algorithms that aim to satisfy inter-group fairness (also called group fairness) may inadvertently treat individuals within the same demographic group unfairly. To address this issue, this article introduces a formal definition of within-group fairness that maintains fairness among individuals from within the same group. A pre-processing framework is proposed to meet both inter- and within-group fairness criteria with little compromise in performance. The framework maps the feature vectors of members from different groups to an inter-group fair canonical domain before feeding them into a scoring function. The mapping is constructed to preserve the relative relationship between the scores obtained from the unprocessed feature vectors of individuals from the same demographic group, guaranteeing within-group fairness. This framework has been applied to the Adult, COMPAS risk assessment, and Law School datasets, and its performance is demonstrated and compared with two regularization-based methods in achieving inter-group and within-group fairness. 
    more » « less
  4. Breast cancer is the leading cancer affecting women globally. Despite deep learning models making significant strides in diagnosing and treating this disease, ensuring fair outcomes across diverse populations presents a challenge, particularly when certain demographic groups are underrepresented in training datasets. Addressing the fairness of AI models across varied demographic backgrounds is crucial. This study analyzes demographic representation within the publicly accessible Emory Breast Imaging Dataset (EMBED), which includes de-identified mammography and clinical data. We spotlight the data disparities among racial and ethnic groups and assess the biases in mammography image classification models trained on this dataset, specifically ResNet-50 and Swin Transformer V2. Our evaluation of classification accuracies across these groups reveals significant variations in model performance, highlighting concerns regarding the fairness of AI diagnostic tools. This paper emphasizes the imperative need for fairness in AI and suggests directions for future research aimed at increasing the inclusiveness and dependability of these technologies in healthcare settings. Code is available at: https://github.com/kuanhuang0624/EMBEDFairModels. 
    more » « less
  5. Recently, there has been a growing interest in developing machine learning (ML) models that can promote fairness, i.e., eliminating biased predictions towards certain populations (e.g., individuals from a specific demographic group). Most existing works learn such models based on well-designed fairness constraints in optimization. Nevertheless, in many practical ML tasks, only very few labeled data samples can be collected, which can lead to inferior fairness performance. This is because existing fairness constraints are designed to restrict the prediction disparity among different sensitive groups, but with few samples, it becomes difficult to accurately measure the disparity, thus rendering ineffective fairness optimization. In this paper, we define the fairness-aware learning task with limited training samples as the fair few-shot learning problem. To deal with this problem, we devise a novel framework that accumulates fairness-aware knowledge across different meta-training tasks and then generalizes the learned knowledge to meta-test tasks. To compensate for insufficient training samples, we propose an essential strategy to select and leverage an auxiliary set for each meta-test task. These auxiliary sets contain several labeled training samples that can enhance the model performance regarding fairness in meta-test tasks, thereby allowing for the transfer of learned useful fairness-oriented knowledge to meta-test tasks. Furthermore, we conduct extensive experiments on three real-world datasets to validate the superiority of our framework against the state-of-the-art baselines. 
    more » « less