skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Auditing for Bias in Ad Delivery Using Inferred Demographic Attributes
Auditing social-media algorithms has become a focus of public-interest research and policymaking to ensure their fairness across demographic groups such as race, age, and gender in consequential domains such as the presentation of employment opportunities. However, such demographic attributes are often unavailable to auditors and platforms. When demographics data is unavailable, auditors commonly \emph{infer} them from other available information. In this work, we study the effects of inference error on auditing for bias in one prominent application: \emph{black-box} audit of ad delivery using \emph{paired ads}. We show that inference error, if not accounted for, causes auditing to falsely miss skew that exists. We then propose a way to mitigate the inference error when evaluating skew in ad delivery algorithms. Our method works by adjusting for expected error due to demographic inference, and it makes skew detection more sensitive when attributes must be inferred. Because inference is increasingly used for auditing, our results provide an important addition to the auditing toolbox to promote correct audits of ad delivery algorithms for bias. While the impact of attribute inference on accuracy has been studied in other domains, our work is the first to consider it for black-box evaluation of ad delivery bias, when only aggregate data is available to the auditor.  more » « less
Award ID(s):
2319409 1956435 2344925
PAR ID:
10632613
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400714825
Page Range / eLocation ID:
2640 to 2656
Format(s):
Medium: X
Location:
Athens Greece
Sponsoring Org:
National Science Foundation
More Like this
  1. Auditing is critical to ensuring the fairness and reliability of decision-making systems. However, auditing a black-box system for bias can be challenging due to the lack of transparency in the model’s internal workings. In many web applications, such as Yelp, it is challenging, if not impossible, to manipulate their inputs systematically to identify bias in the output. Yelp connects users and businesses, where users identify new businesses and simultaneously express their experiences through reviews. Yelp recommendation software moderates user-provided content by categorizing it into recommended and not-recommended sections. The recommended reviews, among other attributes, are used by Yelp’s ranking algorithm to rank businesses in a neighborhood. Due to Yelp’s substantial popularity and its high impact on local businesses’ success, understanding the bias of its algorithms is crucial.This data-driven study, for the first time, investigates the bias of Yelp’s business ranking and review recommendation system. We examine three hypotheses to assess if Yelp’s recommendation software shows bias against reviews of less established users with fewer friends and reviews and if Yelp’s business ranking algorithm shows bias against restaurants located in specific neighborhoods, particularly in hotspot regions, with specific demographic compositions. Our findings show that reviews of less-established users are disproportionately categorized as not-recommended. We also find a positive association between restaurants’ location in hotspot regions and their average exposure. Furthermore, we observed some cases of severe disparity bias in cities where the hotspots are in neighborhoods with less demographic diversity or higher affluence and education levels. 
    more » « less
  2. We conduct an independent, third-party audit for bias of LinkedIn's Talent Search ranking system, focusing on potential ranking bias across two attributes: gender and race. To do so, we first construct a dataset of rankings produced by the system, collecting extensive Talent Search results across a diverse set of occupational queries. We then develop a robust labeling pipeline that infers the two demographic attributes of interest for the returned users.To evaluate potential biases in the collected dataset of real-world rankings, we utilize two exposure disparity metrics: deviation from group proportions and MinSkew@k. Our analysis reveals an under-representation of minority groups in early ranks across many queries.We further examine potential causes of this disparity, and discuss why they may be difficult or, in some cases, impossible to fully eliminate among the early ranks of queries. Beyond static metrics, we also investigate the concept of subgroup fairness over time, highlighting \emph{temporal disparities} in exposure and retention, which are often more difficult to audit for in practice.In employer recruiting platforms such as LinkedIn Talent Search, the persistence of a particular candidate over multiple days in the ranking can directly impact the probability that the given candidate is selected for opportunities. Our analysis reveals demographic disparities in this temporal stability, with some groups experiencing greater volatility in their ranked positions than others.We contextualize all our findings alongside LinkedIn’s published self-audits of its Talent Search system and reflect on the methodological constraints of a black-box external evaluation, including limited observability and noisy demographic inference. Our work contributes empirical insights and practical guidance for conducting third-party audits of modern socio-technical systems which go beyond the well-studied and standard algorithmic fairness guarantees of predictors. 
    more » « less
  3. The 2022 settlement between Meta and the U.S. Department of Justice to resolve allegations of discriminatory advertising resulted is a first-of-its-kind change to Meta's ad delivery system aimed to address algorithmic discrimination in its housing ad delivery. In this work, we explore direct and indirect effects of both the settlement's choice of terms and the Variance Reduction System (VRS) implemented by Meta on the actual reduction in discrimination. \newline We first show that the settlement terms allow for an implementation that does not meaningfully improve access to opportunities for individuals. The settlement measures impact of ad delivery in terms of impressions, instead of unique individuals reached by an ad; it allows the platform to level down access, reducing disparities by decreasing the overall access to opportunities; and it allows the platform to selectively apply VRS to only small advertisers. \newline We then conduct experiments to evaluate VRS with real-world ads, and show that while VRS does reduce variance, it also raises advertiser costs (measured per-individuals-reached), therefore decreasing user exposure to opportunity ads for a given ad budget. VRS thus \emph{passes the cost of decreasing variance to advertisers}. \newline Finally, we explore an alternative approach to achieve the settlement goals, that is significantly more intuitive and transparent than VRS. We show our approach outperforms VRS by both increasing ad exposure for users from \emph{all} groups and reducing cost to advertisers, thus demonstrating that the increase in cost to advertisers when implementing the settlement is not inevitable. \newline Our methodologies use a black-box approach that relies on capabilities available to any regular advertiser, rather than on privileged access to data, allowing others to reproduce or extend our work. 
    more » « less
  4. Modern data analysis and statistical learning are characterized by two defining features: complex data structures and black-box algorithms. The complexity of data structures arises from advanced data collection technologies and data-sharing infrastructures, such as imaging, remote sensing, wearable devices, and genomic sequencing. In parallel, black-box algorithms—particularly those stemming from advances in deep neural networks—have demonstrated remarkable success on modern datasets. This confluence of complex data and opaque models introduces new challenges for uncertainty quantification and statistical inference, a problem we refer to as ``black-box inference''. The difficulty of black-box inference lies in the absence of traditional parametric or nonparametric modeling assumptions, as well as the intractability of the algorithmic behavior underlying many modern estimators. These factors make it difficult to precisely characterize the sampling distribution of estimation errors. A common approach to address this issue is post-hoc randomization, which includes permutation, resampling, sample splitting, cross-validation, and noise injection. When combined with mild assumptions, such as exchangeability in the data-generating process, these methods can yield valid inference and uncertainty quantification. Post-hoc randomization methods have a rich history, ranging from classical techniques like permutation tests, the jackknife, and the bootstrap, to more recent developments such as conformal inference. These approaches typically require minimal knowledge about the underlying data distribution or the inner workings of the estimation procedure. While originally designed for varied purposes, many of these techniques rely, either implicitly or explicitly, on the assumption that the estimation procedure behaves similarly under small perturbations to the data. This idea, now formalized under the concept of \emph{stability}, has become a foundational principle in modern data science. Over the past few decades, stability has emerged as a central research focus in both statistics and machine learning, playing critical roles in areas such as generalization error, data privacy, and adaptive inference. In this article, we investigate one of the most widely used resampling techniques for model comparison and evaluation---cross-validation (CV)---through the lens of stability. We begin by reviewing recent theoretical developments in CV for generalization error estimation and model selection under stability assumptions. We then explore more refined results concerning uncertainty quantification for CV-based risk estimates. By integrating these research directions, we uncover new theoretical insights and methodological tools. Finally, we illustrate their utility across both classical and contemporary topics, including model selection, selective inference, and conformal prediction. 
    more » « less
  5. Discussion of the “right to an explanation” has been increasingly relevant because of its potential utility for auditing automated decision systems, as well as for making objections to such decisions. However, most existing work on explanations focuses on collaborative environments, where designers are motivated to implement good-faith explanations that reveal potential weaknesses of a decision system. This motivation may not hold in an auditing environment. Thus, we ask: how much could explanations be used maliciously to defend a decision system? In this paper, we demonstrate how a black-box explanation system developed to defend a black-box decision system could manipulate decision recipients or auditors into accepting an intentionally discriminatory decision model. In a case-by-case scenario where decision recipients are unable to share their cases and explanations, we find that most individual decision recipients could receive a verifiable justification, even if the decision system is intentionally discriminatory. In a system-wide scenario where every decision is shared, we find that while justifications frequently contradict each other, there is no intuitive threshold to determine if these contradictions are because of malicious justifications or because of simplicity requirements of these justifications conflicting with model behavior. We end with discussion of how system-wide metrics may be more useful than explanation systems for evaluating overall decision fairness, while explanations could be useful outside of fairness auditing. 
    more » « less