skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Diagnosing Gender Bias in Image Recognition Systems
Image recognition systems offer the promise to learn from images at scale without requiring expert knowledge. However, past research suggests that machine learning systems often produce biased output. In this article, we evaluate potential gender biases of commercial image recognition platforms using photographs of U.S. members of Congress and a large number of Twitter images posted by these politicians. Our crowdsourced validation shows that commercial image recognition systems can produce labels that are correct and biased at the same time as they selectively report a subset of many possible true labels. We find that images of women received three times more annotations related to physical appearance. Moreover, women in images are recognized at substantially lower rates in comparison with men. We discuss how encoded biases such as these affect the visibility of women, reinforce harmful gender stereotypes, and limit the validity of the insights that can be gathered from such data.  more » « less
Award ID(s):
1763642
PAR ID:
10547550
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
SAGE Publications
Date Published:
Journal Name:
Socius: Sociological Research for a Dynamic World
Volume:
6
ISSN:
2378-0231
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Automated computer vision systems have been applied in many domains including security, law enforcement, and personal devices, but recent reports suggest that these systems may produce biased results, discriminating against people in certain demographic groups. Diagnosing and understanding the underlying true causes of model biases, however, are challenging tasks because modern computer vision systems rely on complex black-box models whose behaviors are hard to decode. We propose to use an encoder-decoder network developed for image attribute manipulation to synthesize facial images varying in the dimensions of gender and race while keeping other signals intact. We use these synthesized images to measure counterfactual fairness of commercial computer vision classifiers by examining the degree to which these classifiers are affected by gender and racial cues controlled in the images, e.g., feminine faces may elicit higher scores for the concept of nurse and lower scores for STEM-related concepts. 
    more » « less
  2. The prevalent commercial deployment of automated facial analysis systems such as face recognition as a robust authentication method has increasingly fueled scientific attention. Current machine learning algorithms allow for a relatively reliable detection, recognition, and categorization of face images comprised of age, race, and gender. Algorithms with such biased data are bound to produce skewed results. It leads to a significant decrease in the performance of state-of-the-art models when applied to images of gender or ethnicity groups. In this paper, we study the gender bias in facial recognition with gender balanced and imbalanced training sets using five traditional machine learning algorithms. We aim to report the machine learning classifiers which are inclined towards gender bias and the ones which mitigate it. Miss rates metric is effective in finding out potential bias in predictions. Our study utilizes miss rates metric along with a standard metric such as accuracy, precision or recall to evaluate possible gender bias effectively. 
    more » « less
  3. null (Ed.)
    Existing public face image datasets are strongly biased toward Caucasian faces, and other races (e.g., Latino) are significantly underrepresented. The models trained from such datasets suffer from inconsistent classification accuracy, which limits the applicability of face analytic systems to non-White race groups. To mitigate the race bias problem in these datasets, we constructed a novel face image dataset containing 108,501 images which is balanced on race. We define 7 race groups: White, Black, Indian, East Asian, Southeast Asian, Middle Eastern, and Latino. Images were collected from the YFCC-100M Flickr dataset and labeled with race, gender, and age groups. Evaluations were performed on existing face attribute datasets as well as novel image datasets to measure the generalization performance. We find that the model trained from our dataset is substantially more accurate on novel datasets and the accuracy is consistent across race and gender groups. We also compare several commercial computer vision APIs and report their balanced accuracy across gender, race, and age groups. Our code, data, and models are available at https://github.com/joojs/fairface. 
    more » « less
  4. There has been growing recognition of the crucial role users, especially those from marginalized groups, play in uncovering harmful algorithmic biases. However, it remains unclear how users’ identities and experiences might impact their rating of harmful biases. We present an online experiment (N=2,197) examining these factors: demographics, discrimination experiences, and social and technical knowledge. Participants were shown examples of image search results, including ones that previous literature has identified as biased against marginalized racial, gender, or sexual orientation groups. We found participants from marginalized gender or sexual orientation groups were more likely to rate the examples as more severely harmful. Belonging to marginalized races did not have a similar pattern. Additional factors affecting users’ ratings included discrimination experiences, and having friends or family belonging to marginalized demographics. A qualitative analysis offers insights into users' bias recognition, and why they see biases the way they do. We provide guidance for designing future methods to support effective user-driven auditing. 
    more » « less
  5. Distinct scientific theories can make similar predictions. To adjudicate between theories, we must design experiments for which the theories make distinct predictions. Here we consider the problem of comparing deep neural networks as models of human visual recognition. To efficiently compare models’ ability to predict human responses, we synthesize controversial stimuli: images for which different models produce distinct responses. We applied this approach to two visual recognition tasks, handwritten digits (MNIST) and objects in small natural images (CIFAR-10). For each task, we synthesized controversial stimuli to maximize the disagreement among models which employed different architectures and recognition algorithms. Human subjects viewed hundreds of these stimuli, as well as natural examples, and judged the probability of presence of each digit/object category in each image. We quantified how accurately each model predicted the human judgments. The best-performing models were a generative analysis-by-synthesis model (based on variational autoencoders) for MNIST and a hybrid discriminative–generative joint energy model for CIFAR-10. These deep neural networks (DNNs), which model the distribution of images, performed better than purely discriminative DNNs, which learn only to map images to labels. None of the candidate models fully explained the human responses. Controversial stimuli generalize the concept of adversarial examples, obviating the need to assume a ground-truth model. Unlike natural images, controversial stimuli are not constrained to the stimulus distribution models are trained on, thus providing severe out-of-distribution tests that reveal the models’ inductive biases. Controversial stimuli therefore provide powerful probes of discrepancies between models and human perception. 
    more » « less