skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Representation Bias in Data: A Survey on Identification and Resolution Techniques
Data-driven algorithms are only as good as the data they work with, while datasets, especially social data, often fail to represent minorities adequately. Representation Bias in data can happen due to various reasons, ranging from historical discrimination to selection and sampling biases in the data acquisition and preparation methods. Given that “bias in, bias out,” one cannot expect AI-based solutions to have equitable outcomes for societal applications, without addressing issues such as representation bias. While there has been extensive study of fairness in machine learning models, including several review papers, bias in the data has been less studied. This article reviews the literature on identifying and resolving representation bias as a feature of a dataset, independent of how consumed later. The scope of this survey is bounded to structured (tabular) and unstructured (e.g., image, text, graph) data. It presents taxonomies to categorize the studied techniques based on multiple design dimensions and provides a side-by-side comparison of their properties. There is still a long way to fully address representation bias issues in data. The authors hope that this survey motivates researchers to approach these challenges in the future by observing existing work within their respective domains.  more » « less
Award ID(s):
2107290 2106176 1934565
PAR ID:
10438994
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ACM Computing Surveys
Volume:
55
Issue:
13s
ISSN:
0360-0300
Page Range / eLocation ID:
1 to 39
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Weak lensing measurements suffer from well-known shear estimation biases, which can be partially corrected for with the use of image simulations. In this work we present an analysis of simulated images that mimic Hubble Space Telescope/Advance Camera for Surveys observations of high-redshift galaxy clusters, including cluster specific issues such as non-weak shear and increased blending. Our synthetic galaxies have been generated to have similar observed properties as the background-selected source samples studied in the real images. First, we used simulations with galaxies placed on a grid to determine a revised signal-to-noise-dependent ( S / N KSB ) correction for multiplicative shear measurement bias, and to quantify the sensitivity of our KSB+ bias calibration to mismatches of galaxy or PSF properties between the real data and the simulations. Next, we studied the impact of increased blending and light contamination from cluster and foreground galaxies, finding it to be negligible for high-redshift ( z  >  0.7) clusters, whereas shear measurements can be affected at the ∼1% level for lower redshift clusters given their brighter member galaxies. Finally, we studied the impact of fainter neighbours and selection bias using a set of simulated images that mimic the positions and magnitudes of galaxies in Cosmic Assembly Near-IR Deep Extragalactic Legacy Survey (CANDELS) data, thereby including realistic clustering. While the initial SExtractor object detection causes a multiplicative shear selection bias of −0.028 ± 0.002, this is reduced to −0.016 ± 0.002 by further cuts applied in our pipeline. Given the limited depth of the CANDELS data, we compared our CANDELS-based estimate for the impact of faint neighbours on the multiplicative shear measurement bias to a grid-based analysis, to which we added clustered galaxies to even fainter magnitudes based on Hubble Ultra Deep Field data, yielding a refined estimate of ∼ − 0.013. Our sensitivity analysis suggests that our pipeline is calibrated to an accuracy of ∼0.015 once all corrections are applied, which is fully sufficient for current and near-future weak lensing studies of high-redshift clusters. As an application, we used it for a refined analysis of three highly relaxed clusters from the South Pole Telescope Sunyaev-Zeldovich survey, where we now included measurements down to the cluster core ( r  >  200 kpc) as enabled by our work. Compared to previously employed scales ( r  >  500 kpc), this tightens the cluster mass constraints by a factor 1.38 on average. 
    more » « less
  2. Biased AI models result in unfair decisions. In response, a number of algorithmic solutions have been engineered to mitigate bias, among which the Synthetic Minority Oversampling Technique (SMOTE) has been studied, to an extent. Although the SMOTE technique and its variants have great potentials to help improve fairness, there is little theoretical justification for its success. In addition, formal error and fairness bounds are not clearly given. This paper attempts to address both issues. We prove and demonstrate that synthetic data generated by oversampling underrepresented groups can mitigate algorithmic bias in AI models, while keeping the predictive errors bounded. We further compare this technique to the existing state-of-the-art fair AI techniques on five datasets using a variety of fairness metrics. We show that this approach can effectively improve fairness even when there is a significant amount of label and selection bias, regardless of the baseline AI algorithm. 
    more » « less
  3. Graph representation learning has attracted tremendous attention due to its remarkable performance in many real-world applications. However, prevailing supervised graph representation learning models for specific tasks often suffer from label sparsity issue as data labeling is always time and resource consuming. In light of this, few-shot learning on graphs (FSLG), which combines the strengths of graph representation learning and few-shot learning together, has been proposed to tackle the performance degradation in face of limited annotated data challenge. There have been many studies working on FSLG recently. In this paper, we comprehensively survey these work in the form of a series of methods and applications. Specifically, we first introduce FSLG challenges and bases, then categorize and summarize existing work of FSLG in terms of three major graph mining tasks at different granularity levels, i.e., node, edge, and graph. Finally, we share our thoughts on some future research directions of FSLG. The authors of this survey have contributed significantly to the AI literature on FSLG over the last few years. 
    more » « less
  4. Ranking algorithms in online platforms serve not only users on the demand side, but also items on the supply side. While ranking has traditionally presented items in an order that maximizes their utility to users, the uneven interactions that different items receive as a result of such a ranking can pose item fairness concerns. Moreover, interaction is affected by various forms of bias, two of which have received considerable attention: position bias and selection bias. Position bias occurs due to lower likelihood of observation for items in lower ranked positions. Selection bias occurs because interaction is not possible with items below an arbitrary cutoff position chosen by the front-end application at deployment time (i.e., showing only the top-kitems). A less studied, third form of bias, trust bias, is equally important, as it makes interaction dependent on rank even after observation, by influencing the item’s perceived relevance. To capture interaction disparity in the presence of all three biases, in this paper we introduce a flexible fairness metric. Using this metric, we develop a post-processing algorithm that optimizes fairness in ranking through greedy exploration and allows a tradeoff between fairness and utility. Our algorithm outperforms state-of-the-art fair ranking algorithms on several datasets. 
    more » « less
  5. null (Ed.)
    Sentiment detection is an important building block for multiple information retrieval tasks such as product recommendation, cyberbullying, fake news and misinformation detection. Unsurprisingly, multiple commercial APIs, each with different levels of accuracy and fairness, are now publicly available for sentiment detection. Users can easily incorporate these APIs in their applications. While combining inputs from multiple modalities or black-box models for increasing accuracy is commonly studied in multimedia computing literature, there has been little work on combining different modalities for increasing fairness of the resulting decision. In this work, we audit multiple commercial sentiment detection APIs for the gender bias in two-actor news headlines settings and report on the level of bias observed. Next, we propose a "Flexible Fair Regression" approach, which ensures satisfactory accuracy and fairness by jointly learning from multiple black-box models. The results pave way for fair yet accurate sentiment detectors for multiple applications. 
    more » « less