NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The multisided complexity of fairness in recommender systems

https://doi.org/10.1002/aaai.12054

Sonboli, Nasim; Burke, Robin; Ekstrand, Michael; Mehrotra, Rishabh (June 2022, AI Magazine)

Abstract Recommender systems are poised at the interface between stakeholders: for example, job applicants and employers in the case of recommendations of employment listings, or artists and listeners in the case of music recommendation. In such multisided platforms, recommender systems play a key role in enabling discovery of products and information at large scales. However, as they have become more and more pervasive in society, the equitable distribution of their benefits and harms have been increasingly under scrutiny, as is the case with machine learning generally. While recommender systems can exhibit many of the biases encountered in other machine learning settings, the intersection of personalization and multisidedness makes the question of fairness in recommender systems manifest itself quite differently. In this article, we discuss recent work in the area of multisided fairness in recommendation, starting with a brief introduction to core ideas in algorithmic fairness and multistakeholder recommendation. We describe techniques for measuring fairness and algorithmic approaches for enhancing fairness in recommendation outputs. We also discuss feedback and popularity effects that can lead to unfair recommendation outcomes. Finally, we introduce several promising directions for future research in this area.
more » « less
Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor

Olteanu, Alexandra; Blodgett, Su_Lin; Balayn, Agathe; Wang, Angelina; Diaz, Fernando; du_Pin_Calmon, Flavio; Mitchell, Margaret; Ekstrand, Michael D (December 2025, NeurIPS)

In AI research and practice, rigor remains largely understood in terms of methodological rigor — such as whether mathematical, statistical, or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception — in addition to a more expansive understanding of (1) methodological rigor — should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community’s work by researchers, policymakers, journalists, and other stakeholders.
more » « less
Free, publicly-accessible full text available December 2, 2026
Candidate Set Sampling for Evaluating Top-N Recommendation

https://doi.org/10.1109/WI-IAT59888.2023.00018

Ihemelandu, Ngozi; Ekstrand, Michael D. (October 2023, IEEE)

The strategy for selecting candidate sets — the set of items that the recommendation system is expected to rank for each user — is an important decision in carrying out an offline top-N recommender system evaluation. The set of candidates is composed of the union of the user’s test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments.
more » « less
Full Text Available
Distributionally-Informed Recommender System Evaluation

https://doi.org/10.1145/3613455

Ekstrand, Michael D.; Carterette, Ben; Diaz, Fernando (August 2023, ACM Transactions on Recommender Systems)

Current practice for evaluating recommender systems typically focuses on point estimates of user-oriented effectiveness metrics or business metrics, sometimes combined with additional metrics for considerations such as diversity and novelty. In this paper, we argue for the need for researchers and practitioners to attend more closely to various distributions that arise from a recommender system (or other information access system) and the sources of uncertainty that lead to these distributions. One immediate implication of our argument is that both researchers and practitioners must report and examine more thoroughly the distribution of utility between and within different stakeholder groups. However, distributions of various forms arise in many more aspects of the recommender systems experimental process, and distributional thinking has substantial ramifications for how we design, evaluate, and present recommender systems evaluation and research results. Leveraging and emphasizing distributions in the evaluation of recommender systems is a necessary step to ensure that the systems provide appropriate and equitably-distributed benefit to the people they affect.
more » « less
Full Text Available
Inference at Scale: Significance Testing for Large Search and Recommendation Experiments

Ihemelandu, Ngozi; Ekstrand, Michael D. (July 2023, Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23))

A number of information retrieval studies have been done to assess which statistical techniques are appropriate for comparing systems. However, these studies are focused on TREC-style experiments, which typically have fewer than 100 topics. There is no similar line of work for large search and recommendation experiments; such studies typically have thousands of topics or users and much sparser relevance judgements, so it is not clear if recommendations for analyzing traditional TREC experiments apply to these settings. In this paper, we empirically study the behavior of significance tests with large search and recommendation evaluation data. Our results show that the Wilcoxon and Sign tests show significantly higher Type-1 error rates for large sample sizes than the bootstrap, randomization and t-tests, which were more consistent with the expected error rate. While the statistical tests displayed differences in their power for smaller sample sizes, they showed no difference in their power for large sample sizes. We recommend the sign and Wilcoxon tests should not be used to analyze large scale evaluation results. Our result demonstrate that with Top-\(N\) recommendation and large search evaluation data, most tests would have a 100% chance of finding statistically significant results. Therefore, the effect size should be used to determine practical or scientific significance.
more » « less
Full Text Available
Patterns of Gender-Specializing Query Reformulation

Raj, Amifa; Mitra, Bhaksar; Ekstrand, Michael D.; Craswell, Nick (July 2023, Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23))

Users of search systems often reformulate their queries by adding query terms to reflect their evolving information need or to more precisely express their information need when the system fails to surface relevant content. Analyzing these query reformulations can inform us about both system and user behavior. In this work, we study a special category of query reformulations that involve specifying demographic group attributes, such as gender, as part of the reformulated query (e.g., “olympic 2021 soccer results” → “olympic 2021 women‘s soccer results”). There are many ways a query, the search results, and a demographic attribute such as gender may relate, leading us to hypothesize different causes for these reformulation patterns, such as under-representation on the original result page or based on the linguistic theory of markedness. This paper reports on an observational study of gender-specializing query reformulations—their contexts and effects—as a lens on the relationship between system results and gender, based on large-scale search log data from Bing. We find that these reformulations sometimes correct for and other times reinforce gender representation on the original result page, but typically yield better access to the ultimately-selected results. The prevalence of these reformulations—and which gender they skew towards—differ by topical context. However, we do not find evidence that either group under-representation or markedness alone adequately explains these reformulations. We hope that future research will use such reformulations as a probe for deeper investigation into gender (and other demographic) representation on the search result page.
more » « less
Full Text Available
Much Ado About Gender: Current Practices and Future Recommendations for Appropriate Gender-Aware Information Access

https://doi.org/10.1145/3576840.3578316

Pinney, Christine; Raj, Amifa; Hanna, Alex; Ekstrand, Michael D. (March 2023, CHIIR '23: Proceedings of the 2023 Conference on Human Information Interaction and Retrieval)

Information access research (and development) sometimes makes use of gender, whether to report on the demographics of participants in a user study, as inputs to personalized results or recommendations, or to make systems gender-fair, amongst other purposes. This work makes a variety of assumptions about gender, however, that are not necessarily aligned with current understandings of what gender is, how it should be encoded, and how a gender variable should be ethically used. In this work, we present a systematic review of papers on information retrieval and recommender systems that mention gender in order to document how gender is currently being used in this field. We find that most papers mentioning gender do not use an explicit gender variable, but most of those that do either focus on contextualizing results of model performance, personalizing a system based on assumptions of user gender, or auditing a model’s behavior for fairness or other privacy-related issues. Moreover, most of the papers we review rely on a binary notion of gender, even if they acknowledge that gender cannot be split into two categories. We connect these findings with scholarship on gender theory and recent work on gender in human-computer interaction and natural language processing. We conclude by making recommendations for ethical and well-grounded use of gender in building and researching information access systems.
more » « less
Full Text Available
Measuring Fairness in Ranked Results: An Analytical and Empirical Comparison

https://doi.org/10.1145/3477495.3532018

Raj, Amifa; Ekstrand, Michael D. (July 2022, Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval)

Information access systems, such as search and recommender systems, often use ranked lists to present results believed to be relevant to the user’s information need. Evaluating these lists for their fairness along with other traditional metrics provide a more complete understanding of an information access system’s behavior beyond accuracy or utility constructs. To measure the (un)fairness of rankings, particularly with respect to protected group(s) of producers or providers, several metrics have been proposed in the last several years. However, an empirical and comparative analyses of these metrics showing the applicability to specific scenario or real data, conceptual similarities, and differences is still lacking. We aim to bridge the gap between theoretical and practical application of these metrics. In this paper we describe several fair ranking metrics from the existing literature in a common notation, enabling direct comparison of their approaches and assumptions, and empirically compare them on the same experimental setup and data sets in the context of three information access tasks. We also provide a sensitivity analysis to assess the impact of the design choices and parameter settings that go in to these metrics and point to additional work needed to improve fairness measurement.
more » « less
Full Text Available
Baby Shark to Barracuda: Analyzing Children’s Music Listening Behavior

https://doi.org/10.1145/3460231.3478856

Spear, Lawrence; Milton, Ashlee; Allen, Garrett; Raj, Amifa; Green, Michael; Ekstrand, Michael D; Pera, Maria Soledad (September 2021, Fifteenth ACM Conference on Recommender Systems)

Music is an important part of childhood development, with online music listening platforms being a significant channel by which children consume music. Children’s offline music listening behavior has been heavily researched, yet relatively few studies explore how their behavior manifests online. In this paper, we use data from LastFM 1 Billion and the Spotify API to explore online music listening behavior of children, ages 6–17, using education levels as lenses for our analysis. Understanding the music listening behavior of children can be used to inform the future design of recommender systems.
more » « less
Full Text Available
Pink for Princesses, Blue for Superheroes: The Need to Examine Gender Stereotypes in Kid's Products in Search and Recommendations

https://doi.org/10.48550/arXiv.2105.09296

Raj, Amifa; Milton, Ashlee; Ekstrand, Michael D. (May 2021, KidRec '21: 5th International and Interdisciplinary Perspectives on Children \& Recommender and Information Retrieval Systems (KidRec) Search and Recommendation Technology through the Lens of a Teacher- Co-located with ACM IDC 2021)

In this position paper, we argue for the need to investigate if and how gender stereotypes manifest in search and recommender this http URL a starting point, we particularly focus on how these systems may propagate and reinforce gender stereotypes through their results in learning environments, a context where teachers and children in their formative stage regularly interact with these systems. We provide motivating examples supporting our concerns and outline an agenda to support future research addressing the phenomena.
more » « less
Full Text Available

« Prev Next »

Search for: All records