skip to main content


Title: Estimating Error and Bias in Offline Evaluation Results
Offline evaluation protocols for recommender systems are intended to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations allow researchers and production developers to carry out first-pass estimates of the likely performance of a new system and weed out bad ideas before presenting them to users. However, offline evaluations cannot accurately assess novel, relevant recommendations, because the most novel recommendations items that were previously unknown to the user; such items are missing from the historical data, so they cannot be judged as relevant. A breakthrough that reliably produces novel, relevant recommendations would score poorly with current offline evaluation techniques. While the existence of this problem is noted in the literature, its extent is not well-understood. We present a simulation study to estimate the error that such missing data causes in commonly-used evaluation metrics in order to assess its prevalence and impact. We find that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Substantial breakthroughs in recommendation quality, therefore, will be difficult to assess with existing offline techniques.  more » « less
Award ID(s):
1751278
NSF-PAR ID:
10146883
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 2020 Conference on Computer-Human Interaction and Information Retrieval
Page Range / eLocation ID:
392 - 396
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Today’s recommender systems are criticized for recommending items that are too obvious to arouse users’ interest. That is why the recommender systems research community has advocated some ”beyond accuracy” evaluation metrics such as novelty, diversity, coverage, and serendipity with the hope of promoting information discovery and sustain users’ interest over a long period of time. While bringing in new perspectives, most of these evaluation metrics have not considered individual users’ difference: an open-minded user may favor highly novel or diversified recommendations whereas a conservative user’s appetite for novelty or diversity may not be that large. In this paper, we developed a model to approximate an individual’s curiosity distribution over different levels of stimuli guided by the well-known Wundt curve in Psychology. We measured an item’s surprise level to assess the stimulation level and whether it is in the range of the user’s appetite for stimulus. We then proposed a recommendation system framework that considers both user preference and appetite for stimulus where the curiosity is maximally aroused. Our framework differs from a typical recommender system in that it leverages human’s curiosity to promote intrinsic interest with the system. A series of evaluation experiments have been conducted to show that our framework is able to rank higher the items with not only high ratings but also high response likelihood. The recommendation list generated by our algorithm has higher potential of inspiring user curiosity compared to traditional approaches. The personalization factor for assessing the stimulus (surprise) strength further helps the recommender achieve smaller (better) inter-user similarity. 
    more » « less
  2. The strategy for selecting candidate sets — the set of items that the recommendation system is expected to rank for each user — is an important decision in carrying out an offline top-N recommender system evaluation. The set of candidates is composed of the union of the user’s test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments. 
    more » « less
  3. Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work on identifying and addressing these problems, and report on our work in progress to simulate the recommender data generation and evaluation processes to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions. 
    more » « less
  4. Today’s recommender systems are criticized for recommending items that are too obvious to arouse users’ interests. Therefore the research community has advocated some ”beyond accuracy” evaluation metrics such as novelty, diversity, and serendipity with the hope of promoting information discovery and sustaining users’ interests over a long period of time. While bringing in new perspectives, most of these evaluation metrics have not considered individual users’ differences in their capacity to experience those ”beyond accuracy” items. Open-minded users may embrace a wider range of recommendations than conservative users. In this paper, we proposed to use curiosity traits to capture such individual users’ differences. We developed a model to approximate an individual’s curiosity distribution over different stimulus levels. We used an item’s surprise level to estimate the stimulus level and whether such a level is in the range of the user’s appetite for stimulus, calledComfort Zone. We then proposed a recommender system framework that considers both user preference and theirComfort Zonewhere the curiosity is maximally aroused. Our framework differs from a typical recommender system in that it leverages human’sComfort Zonefor stimuli to promote engagement with the system. A series of evaluation experiments have been conducted to show that our framework is able to rank higher the items with not only high ratings but also high curiosity stimulation. The recommendation list generated by our algorithm has higher potential of inspiring user curiosity compared to the state-of-the-art deep learning approaches. The personalization factor for assessing the surprise stimulus levels further helps the recommender model achieve smaller (better) inter-user similarity.

     
    more » « less
  5. Methods for making high-quality recommendations often rely on learning latent representations from interaction data. These methods, while performant, do not provide ready mechanisms for users to control the recommendation they receive. Our work tackles this problem by proposing LACE, a novel concept value bottleneck model for controllable text recommendations. LACE represents each user with a succinct set of human-readable concepts through retrieval given user-interacted documents and learns personalized representations of the concepts based on user documents. This concept based user profile is then leveraged to make recommendations. The design of our model affords control over the recommendations through a number of intuitive interactions with a transparent user profile. We first establish the quality of recommendations obtained from LACE in an offline evaluation on three recommendation tasks spanning six datasets in warm-start, cold-start, and zero-shot setups. Next, we validate the controllability of LACE under simulated user interactions. Finally, we implement LACE in an interactive controllable recommender system and conduct a user study to demonstrate that users are able to improve the quality of recommendations they receive through interactions with an editable user profile. 
    more » « less