Personalized recommender systems play a crucial role in modern society, especially in e-commerce, news, and ads areas. Correctly evaluating and comparing candidate recommendation models is as essential as constructing ones. The common offline evaluation strategy is holding out some user-interacted items from training data and evaluating the performance of recommendation models based on how many items they can retrieve. Specifically, for any hold-out item or so-called target item for a user, the recommendation models try to predict the probability that the user would interact with the item and rank it among overall items, which is called
In this work, we first investigate the “inconsistent” phenomenon by taking a glance at the connections between sampling evaluation and global evaluation. We reveal the approximately linear relationship between sampling with respect to its global counterpart in terms of the top-