Today’s recommender systems are criticized for recommending items that are too obvious to arouse users’ interest. That is why the recommender systems research community has advocated some ”beyond accuracy” evaluation metrics such as novelty, diversity, coverage, and serendipity with the hope of promoting information discovery and sustain users’ interest over a long period of time. While bringing in new perspectives, most of these evaluation metrics have not considered individual users’ difference: an open-minded user may favor highly novel or diversified recommendations whereas a conservative user’s appetite for novelty or diversity may not be that large. In this paper, we developedmore »
Estimating Error and Bias in Offline Evaluation Results
Offline evaluation protocols for recommender systems are intended to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations allow researchers and production developers to carry out first-pass estimates of the likely performance of a new system and weed out bad ideas before presenting them to users. However, offline evaluations cannot accurately assess novel, relevant recommendations, because the most novel recommendations items that were previously unknown to the user; such items are missing from the historical data, so they cannot be judged as relevant. A breakthrough that reliably produces novel, relevant recommendations would score poorly with current offline evaluation techniques.
While the existence of this problem is noted in the literature, its extent is not well-understood. We present a simulation study to estimate the error that such missing data causes in commonly-used evaluation metrics in order to assess its prevalence and impact. We find that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Substantial breakthroughs in recommendation quality, therefore, will be difficult to assess with existing offline techniques.
- Award ID(s):
- 1751278
- Publication Date:
- NSF-PAR ID:
- 10146883
- Journal Name:
- Proceedings of the 2020 Conference on Computer-Human Interaction and Information Retrieval
- Page Range or eLocation-ID:
- 392 - 396
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work on identifying and addressing these problems, and report on our work in progress to simulate the recommender data generation and evaluation processes to quantify the extent of evaluationmore »
-
Recent work in recommender systems has emphasized the importance of fairness, with a particular interest in bias and transparency, in addition to predictive accuracy. In this paper, we focus on the state of the art pairwise ranking model, Bayesian Personalized Ranking (BPR), which has previously been found to outperform pointwise models in predictive accuracy, while also being able to handle implicit feedback. Specifically, we address two limitations of BPR: (1) BPR is a black box model that does not explain its outputs, thus limiting the user's trust in the recommendations, and the analyst's ability to scrutinize a model's outputs; andmore »
-
Context has been recognized as an important factor to consider in personalized recommender systems. Particularly in location-based services (LBSs), a fundamental task is to recommend to a mobile user where he/she could be interested to visit next at the right time. Additionally, location-based social networks (LBSNs) allow users to share location-embedded information with friends who often co-occur in the same or nearby points-of-interest (POIs) or share similar POI visiting histories, due to the social homophily theory and Tobler’s first law of geography. So, both the time information and LBSN friendship relations should be utilized for POI recommendation. Tensor completion hasmore »
-
Matrix completion is a well-known approach for recommender systems. It predicts the values of the missing entries in a sparse user-item interaction matrix, based on the low-rank structure of the rating matrix. However, existing matrix completion methods do not take node polysemy and side information of social relationships into consideration, which can otherwise further improve the performance. In this paper, we propose a novel matrix completion method that employs both users’ friendships and rating entries to predict the missing values in a user-item matrix. Our approach adopts a graph-based modeling where nodes are users and items, and two types ofmore »