skip to main content

Title: Monte Carlo Estimates of Evaluation Metric Error and Bias
Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work on identifying and addressing these problems, and report on our work in progress to simulate the recommender data generation and evaluation processes to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions.
Authors:
;
Award ID(s):
1751278
Publication Date:
NSF-PAR ID:
10074452
Journal Name:
REVEAL 2018 Workshop on Offline Evaluation in Recommender Systems
Sponsoring Org:
National Science Foundation
More Like this
  1. Meila, Marina ; Zhang, Tong (Ed.)
    Incorporating graph side information into recommender systems has been widely used to better predict ratings, but relatively few works have focused on theoretical guarantees. Ahn et al. (2018) firstly characterized the optimal sample complexity in the presence of graph side information, but the results are limited due to strict, unrealistic assumptions made on the unknown latent preference matrix and the structure of user clusters. In this work, we propose a new model in which 1) the unknown latent preference matrix can have any discrete values, and 2) users can be clustered into multiple clusters, thereby relaxing the assumptions made inmore »prior work. Under this new model, we fully characterize the optimal sample complexity and develop a computationally-efficient algorithm that matches the optimal sample complexity. Our algorithm is robust to model errors and outperforms the existing algorithms in terms of prediction performance on both synthetic and real data.« less
  2. Abstract. The evaluation of aerosol radiative effect on broadband hemispherical solar flux is often performed using simplified spectral and directional scattering characteristics of atmospheric aerosol and underlying surface reflectance. In this study we present a rigorous yet fast computational tool that accurately accounts for detailed variability of both spectral and angular scattering properties of aerosol and surface reflectance in calculation of direct aerosol radiative effect. The tool is developed as part of the GRASP (Generalized Retrieval of Aerosol and Surface Properties) project. We use the tool to evaluate instantaneous and daily average radiative efficiencies (radiative effect per unit aerosol opticalmore »thickness) of several key atmospheric aerosol models over different surface types. We then examine the differences due to neglect of surface reflectance anisotropy, nonsphericity of aerosol particle shape and accounting only for aerosol angular scattering asymmetry instead of using full phase function. For example, it is shown that neglecting aerosol particle nonsphericity causes mainly overestimation of the aerosol cooling effect and that magnitude of this overestimate changes significantly as a function of solar zenith angle (SZA) if the asymmetry parameter is used instead of detailed phase function. It was also found that the nonspherical–spherical differences in the calculated aerosol radiative effect are not modified significantly if detailed BRDF (bidirectional reflectance distribution function) is used instead of Lambertian approximation of surface reflectance. Additionally, calculations show that usage of only angular scattering asymmetry, even for the case of spherical aerosols, modifies the dependence of instantaneous aerosol radiative effect on SZA. This effect can be canceled for daily average values, but only if sun reaches the zenith; otherwise a systematic bias remains. Since the daily average radiative effect is obtained by integration over a range of SZAs, the errors vary with latitude and season. In summary, the present analysis showed that use of simplified assumptions causes systematic biases, rather than random uncertainties, in calculation of both instantaneous and daily average aerosol radiative effect. Finally, we illustrate application of the rigorous aerosol radiative effect calculations performed as part of GRASP aerosol retrieval from real POLDER/PARASOL satellite observations.« less
  3. Abstract

    This paper investigates the ability of the Weather Research and Forecasting (WRF) Model in simulating multiple small-scale precipitation bands (multibands) within the extratropical cyclone comma head using four winter storm cases from 2014 to 2017. Using the model output, some physical processes are explored to investigate band prediction. A 40-member WRF ensemble was constructed down to 2-km grid spacing over the Northeast United States using different physics, stochastic physics perturbations, different initial/boundary conditions from the first five perturbed members of the Global Forecast System (GFS) Ensemble Reforecast (GEFSR), and a stochastic kinetic energy backscatter scheme (SKEBS). It was foundmore »that 2-km grid spacing is adequate to resolve most snowbands. A feature-based verification is applied to hourly WRF reflectivity fields from each ensemble member and the WSR-88D radar reflectivity at 2-km height above sea level. The Method for Object-Based Diagnostic Evaluation (MODE) tool is used for identifying multibands, which are defined as two or more bands that are 5–20 km in width and that also exhibit a >2:1 aspect ratio. The WRF underpredicts the number of multibands and has a slight eastward position bias. There is no significant difference in frontogenetical forcing, vertical stability, moisture, and vertical shear between the banded versus nonbanded members. Underpredicted band members tend to have slightly stronger frontogenesis than observed, which may be consolidating the bands, but overall there is no clear linkage in ambient condition errors and band errors, thus leaving the source for the band underprediction motivation for future work.

    « less
  4. Recent work in recommender systems has emphasized the importance of fairness, with a particular interest in bias and transparency, in addition to predictive accuracy. In this paper, we focus on the state of the art pairwise ranking model, Bayesian Personalized Ranking (BPR), which has previously been found to outperform pointwise models in predictive accuracy, while also being able to handle implicit feedback. Specifically, we address two limitations of BPR: (1) BPR is a black box model that does not explain its outputs, thus limiting the user's trust in the recommendations, and the analyst's ability to scrutinize a model's outputs; andmore »(2) BPR is vulnerable to exposure bias due to the data being Missing Not At Random (MNAR). This exposure bias usually translates into an unfairness against the least popular items because they risk being under-exposed by the recommender system. In this work, we first propose a novel explainable loss function and a corresponding Matrix Factorization-based model called Explainable Bayesian Personalized Ranking (EBPR) that generates recommendations along with item-based explanations. Then, we theoretically quantify additional exposure bias resulting from the explainability, and use it as a basis to propose an unbiased estimator for the ideal EBPR loss. The result is a ranking model that aptly captures both debiased and explainable user preferences. Finally, we perform an empirical study on three real-world datasets that demonstrate the advantages of our proposed models.« less
  5. Summary Instrumental variable methods can identify causal effects even when the treatment and outcome are confounded. We study the problem of imperfect measurements of the binary instrumental variable, treatment and outcome. We first consider nondifferential measurement errors, that is, the mismeasured variable does not depend on other variables given its true value. We show that the measurement error of the instrumental variable does not bias the estimate, that the measurement error of the treatment biases the estimate away from zero, and that the measurement error of the outcome biases the estimate toward zero. Moreover, we derive sharp bounds on themore »causal effects without additional assumptions. These bounds are informative because they exclude zero. We then consider differential measurement errors, and focus on sensitivity analyses in those settings.« less