skip to main content

Title: Monte Carlo Estimates of Evaluation Metric Error and Bias
Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work on identifying and addressing these problems, and report on our work in progress to simulate the recommender data generation and evaluation processes to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions.
Award ID(s):
Publication Date:
Journal Name:
REVEAL 2018 Workshop on Offline Evaluation in Recommender Systems
Sponsoring Org:
National Science Foundation
More Like this
  1. Meila, Marina ; Zhang, Tong (Ed.)
    Incorporating graph side information into recommender systems has been widely used to better predict ratings, but relatively few works have focused on theoretical guarantees. Ahn et al. (2018) firstly characterized the optimal sample complexity in the presence of graph side information, but the results are limited due to strict, unrealistic assumptions made on the unknown latent preference matrix and the structure of user clusters. In this work, we propose a new model in which 1) the unknown latent preference matrix can have any discrete values, and 2) users can be clustered into multiple clusters, thereby relaxing the assumptions made inmore »prior work. Under this new model, we fully characterize the optimal sample complexity and develop a computationally-efficient algorithm that matches the optimal sample complexity. Our algorithm is robust to model errors and outperforms the existing algorithms in terms of prediction performance on both synthetic and real data.« less
  2. Abstract

    This paper investigates the ability of the Weather Research and Forecasting (WRF) Model in simulating multiple small-scale precipitation bands (multibands) within the extratropical cyclone comma head using four winter storm cases from 2014 to 2017. Using the model output, some physical processes are explored to investigate band prediction. A 40-member WRF ensemble was constructed down to 2-km grid spacing over the Northeast United States using different physics, stochastic physics perturbations, different initial/boundary conditions from the first five perturbed members of the Global Forecast System (GFS) Ensemble Reforecast (GEFSR), and a stochastic kinetic energy backscatter scheme (SKEBS). It was foundmore »that 2-km grid spacing is adequate to resolve most snowbands. A feature-based verification is applied to hourly WRF reflectivity fields from each ensemble member and the WSR-88D radar reflectivity at 2-km height above sea level. The Method for Object-Based Diagnostic Evaluation (MODE) tool is used for identifying multibands, which are defined as two or more bands that are 5–20 km in width and that also exhibit a >2:1 aspect ratio. The WRF underpredicts the number of multibands and has a slight eastward position bias. There is no significant difference in frontogenetical forcing, vertical stability, moisture, and vertical shear between the banded versus nonbanded members. Underpredicted band members tend to have slightly stronger frontogenesis than observed, which may be consolidating the bands, but overall there is no clear linkage in ambient condition errors and band errors, thus leaving the source for the band underprediction motivation for future work.

    « less
  3. Recent work in recommender systems has emphasized the importance of fairness, with a particular interest in bias and transparency, in addition to predictive accuracy. In this paper, we focus on the state of the art pairwise ranking model, Bayesian Personalized Ranking (BPR), which has previously been found to outperform pointwise models in predictive accuracy, while also being able to handle implicit feedback. Specifically, we address two limitations of BPR: (1) BPR is a black box model that does not explain its outputs, thus limiting the user's trust in the recommendations, and the analyst's ability to scrutinize a model's outputs; andmore »(2) BPR is vulnerable to exposure bias due to the data being Missing Not At Random (MNAR). This exposure bias usually translates into an unfairness against the least popular items because they risk being under-exposed by the recommender system. In this work, we first propose a novel explainable loss function and a corresponding Matrix Factorization-based model called Explainable Bayesian Personalized Ranking (EBPR) that generates recommendations along with item-based explanations. Then, we theoretically quantify additional exposure bias resulting from the explainability, and use it as a basis to propose an unbiased estimator for the ideal EBPR loss. The result is a ranking model that aptly captures both debiased and explainable user preferences. Finally, we perform an empirical study on three real-world datasets that demonstrate the advantages of our proposed models.« less
  4. Summary Instrumental variable methods can identify causal effects even when the treatment and outcome are confounded. We study the problem of imperfect measurements of the binary instrumental variable, treatment and outcome. We first consider nondifferential measurement errors, that is, the mismeasured variable does not depend on other variables given its true value. We show that the measurement error of the instrumental variable does not bias the estimate, that the measurement error of the treatment biases the estimate away from zero, and that the measurement error of the outcome biases the estimate toward zero. Moreover, we derive sharp bounds on themore »causal effects without additional assumptions. These bounds are informative because they exclude zero. We then consider differential measurement errors, and focus on sensitivity analyses in those settings.« less
  5. Abstract. The Global Ocean Data Analysis Project (GLODAP) is asynthesis effort providing regular compilations of surface-to-bottom oceanbiogeochemical data, with an emphasis on seawater inorganic carbon chemistryand related variables determined through chemical analysis of seawatersamples. GLODAPv2.2020 is an update of the previous version, GLODAPv2.2019.The major changes are data from 106 new cruises added, extension of timecoverage to 2019, and the inclusion of available (also for historicalcruises) discrete fugacity of CO2 (fCO2) values in the mergedproduct files. GLODAPv2.2020 now includes measurements from more than 1.2 million water samples from the global oceans collected on 946 cruises. Thedata for the 12 GLODAP core variablesmore »(salinity, oxygen, nitrate, silicate,phosphate, dissolved inorganic carbon, total alkalinity, pH, CFC-11, CFC-12,CFC-113, and CCl4) have undergone extensive quality control with afocus on systematic evaluation of bias. The data are available in twoformats: (i) as submitted by the data originator but updated to WOCEexchange format and (ii) as a merged data product with adjustments appliedto minimize bias. These adjustments were derived by comparing the data fromthe 106 new cruises with the data from the 840 quality-controlled cruises ofthe GLODAPv2.2019 data product using crossover analysis. Comparisons toempirical algorithm estimates provided additional context for adjustmentdecisions; this is new to this version. The adjustments are intended toremove potential biases from errors related to measurement, calibration, anddata-handling practices without removing known or likely time trends orvariations in the variables evaluated. The compiled and adjusted dataproduct is believed to be consistent to better than 0.005 in salinity, 1 % in oxygen, 2 % in nitrate, 2 % in silicate, 2 % in phosphate,4 µmol kg−1 in dissolved inorganic carbon, 4 µmol kg−1in total alkalinity, 0.01–0.02 in pH (depending on region), and 5 % inthe halogenated transient tracers. The other variables included in thecompilation, such as isotopic tracers and discrete fCO2, were notsubjected to bias comparison or adjustments. The original data and their documentation and DOI codes are available at theOcean Carbon Data System of NOAA NCEI(, lastaccess: 20 June 2020). This site also provides access to the merged dataproduct, which is provided as a single global file and as four regional ones– the Arctic, Atlantic, Indian, and Pacific oceans –under (Olsen et al., 2020). Thesebias-adjusted product files also include significant ancillary andapproximated data. These were obtained by interpolation of, or calculationfrom, measured data. This living data update documents the GLODAPv2.2020methods and provides a broad overview of the secondary quality controlprocedures and results.« less