skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Uncertainty Quantification for Fairness in Two-Stage Recommender Systems
Many large-scale recommender systems consist of two stages. The first stage efficiently screens the complete pool of items for a small subset of promising candidates, from which the second-stage model curates the final recommendations. In this paper, we investigate how to ensure group fairness to the items in this two-stage architecture. In particular, we find that existing first-stage recommenders might select an irrecoverably unfair set of candidates such that there is no hope for the second-stage recommender to deliver fair recommendations. To this end, motivated by recent advances in uncertainty quantification, we propose two threshold-policy selection rules that can provide distribution-free and finite-sample guarantees on fairness in first-stage recommenders. More concretely, given any relevance model of queries and items and a point-wise lower confidence bound on the expected number of relevant items for each threshold-policy, the two rules find near-optimal sets of candidates that contain enough relevant items in expectation from each group of items. To instantiate the rules, we demonstrate how to derive such confidence bounds from potentially partial and biased user feedback data, which are abundant in many large-scale recommender systems. In addition, we provide both finite-sample and asymptotic analyses of how close the two threshold selection rules are to the optimal thresholds. Beyond this theoretical analysis, we show empirically that these two rules can consistently select enough relevant items from each group while minimizing the size of the candidate sets for a wide range of settings.  more » « less
Award ID(s):
2008139
PAR ID:
10466320
Author(s) / Creator(s):
;
Date Published:
Journal Name:
ACM Conference on Web Search and Data Mining (WSDM)
Page Range / eLocation ID:
940 to 948
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The strategy for selecting candidate sets — the set of items that the recommendation system is expected to rank for each user — is an important decision in carrying out an offline top-N recommender system evaluation. The set of candidates is composed of the union of the user’s test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments. 
    more » « less
  2. null (Ed.)
    Recently there has been a growing interest in fairness-aware recommender systems including fairness in providing consistent performance across different users or groups of users. A recommender system could be considered unfair if the recommendations do not fairly represent the tastes of a certain group of users while other groups receive recommendations that are consistent with their preferences. In this paper, we use a metric called miscalibration for measuring how a recommendation algorithm is responsive to users’ true preferences and we consider how various algorithms may result in different degrees of miscalibration for different users. In particular, we conjecture that popularity bias which is a well-known phenomenon in recommendation is one important factor leading to miscalibration in recommendation. Our experimental results using two real-world datasets show that there is a connection between how different user groups are affected by algorithmic popularity bias and their level of interest in popular items. Moreover, we show that the more a group is affected by the algorithmic popularity bias, the more their recommendations are miscalibrated. 
    more » « less
  3. While the algorithms used by music streaming services to provide recommendations have often been studied in offline, isolated settings, little research has been conducted studying the nature of their recommendations within the full context of the system itself. This work seeks to compare the level of diversity of the real-world recommendations provided by five of the most popular music streaming services, given the same lists of low-, medium- and high-diversity input items. We contextualized our results by examining the reviews for each of the five services on the Google Play Store, focusing on users’ perception of their recommender systems and the diversity of their output. We found that YouTube Music offered the most diverse recommendations, but the perception of the recommenders was similar across the five services. Consumers had multiple perspectives on the recommendations provided by their music service—ranging from not wanting any recommendations to applauding the algorithm for helping them find new music. 
    more » « less
  4. Abstract Personalized news experiences powered by recommender systems permeate our lives and have the potential to influence not only our opinions, but also our decisions. At the same time, the content and viewpoints contained within news recommendations are driven by multiple factors, including both personalization and editorial selection. Explanations could help users gain a better understanding of the factors contributing to the news items selected for them to read. Indeed, recent works show that explanations are essential for users of news recommenders to understand their consumption preferences and set intentions in line with their goals, such as goals for knowledge development and increased diversity of content or viewpoints. We give examples of such works on explanation and interactive interface interventions which have been effective in influencing readers' consumption intentions and behaviors in news recommendations. However, the state‐of‐the‐art in news recommender systems currently fall short in terms of evaluating such interventions in live systems, limiting our ability to measure their true impact on user behavior and opinions. To help understand the true benefit of these interfaces, we therefore call for improving the realism of studies for news. 
    more » « less
  5. Offline evaluation protocols for recommender systems are intended to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations allow researchers and production developers to carry out first-pass estimates of the likely performance of a new system and weed out bad ideas before presenting them to users. However, offline evaluations cannot accurately assess novel, relevant recommendations, because the most novel recommendations items that were previously unknown to the user; such items are missing from the historical data, so they cannot be judged as relevant. A breakthrough that reliably produces novel, relevant recommendations would score poorly with current offline evaluation techniques. While the existence of this problem is noted in the literature, its extent is not well-understood. We present a simulation study to estimate the error that such missing data causes in commonly-used evaluation metrics in order to assess its prevalence and impact. We find that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Substantial breakthroughs in recommendation quality, therefore, will be difficult to assess with existing offline techniques. 
    more » « less