skip to main content


Title: Monte Carlo Estimates of Evaluation Metric Error and Bias
Traditional offline evaluations of recommender systems apply metrics from machine learning and information retrieval in settings where their underlying assumptions no longer hold. This results in significant error and bias in measures of top-N recommendation performance, such as precision, recall, and nDCG. Several of the specific causes of these errors, including popularity bias and misclassified decoy items, are well-explored in the existing literature. In this paper we survey a range of work on identifying and addressing these problems, and report on our work in progress to simulate the recommender data generation and evaluation processes to quantify the extent of evaluation metric errors and assess their sensitivity to various assumptions.  more » « less
Award ID(s):
1751278
NSF-PAR ID:
10074452
Author(s) / Creator(s):
;
Date Published:
Journal Name:
REVEAL 2018 Workshop on Offline Evaluation in Recommender Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Currently, there is a surge of interest in fair Artificial Intelligence (AI) and Machine Learning (ML) research which aims to mitigate discriminatory bias in AI algorithms, e.g., along lines of gender, age, and race. While most research in this domain focuses on developing fair AI algorithms, in this work, we examine the challenges which arise when humans and fair AI interact. Our results show that due to an apparent conflict between human preferences and fairness, a fair AI algorithm on its own may be insufficient to achieve its intended results in the real world. Using college major recommendation as a case study, we build a fair AI recommender by employing gender debiasing machine learning techniques. Our offline evaluation showed that the debiased recommender makes fairer career recommendations without sacrificing its accuracy in prediction. Nevertheless, an online user study of more than 200 college students revealed that participants on average prefer the original biased system over the debiased system. Specifically, we found that perceived gender disparity is a determining factor for the acceptance of a recommendation. In other words, we cannot fully address the gender bias issue in AI recommendations without addressing the gender bias in humans. We conducted a follow-up survey to gain additional insights into the effectiveness of various design options that can help participants to overcome their own biases. Our results suggest that making fair AI explainable is crucial for increasing its adoption in the real world. 
    more » « less
  2. Meila, Marina ; Zhang, Tong (Ed.)
    Incorporating graph side information into recommender systems has been widely used to better predict ratings, but relatively few works have focused on theoretical guarantees. Ahn et al. (2018) firstly characterized the optimal sample complexity in the presence of graph side information, but the results are limited due to strict, unrealistic assumptions made on the unknown latent preference matrix and the structure of user clusters. In this work, we propose a new model in which 1) the unknown latent preference matrix can have any discrete values, and 2) users can be clustered into multiple clusters, thereby relaxing the assumptions made in prior work. Under this new model, we fully characterize the optimal sample complexity and develop a computationally-efficient algorithm that matches the optimal sample complexity. Our algorithm is robust to model errors and outperforms the existing algorithms in terms of prediction performance on both synthetic and real data. 
    more » « less
  3. Abstract. The impact of biomass burning (BB) on the atmospheric burden of volatile organic compounds (VOCs) is highly uncertain. Here we apply the GEOS-Chemchemical transport model (CTM) to constrain BB emissions in the western USA at ∼ 25 km resolution. Across three BB emission inventorieswidely used in CTMs, the inventory–inventory comparison suggests that the totals of 14 modeled BB VOC emissions in the western USA agree with eachother within 30 %–40 %. However, emissions for individual VOCs can differ by a factor of 1–5, driven by the regionally averaged emissionratios (ERs, reflecting both assigned ERs for specific biome and vegetation classifications) across the three inventories. We further evaluate GEOS-Chemsimulations with aircraft observations made during WE-CAN (Western Wildfire Experiment for Cloud Chemistry, Aerosol Absorption and Nitrogen) andFIREX-AQ (Fire Influence on Regional to Global Environments and Air Quality) field campaigns. Despite being driven by different global BBinventories or applying various injection height assumptions, the model–observation comparison suggests that GEOS-Chem simulations underpredictobserved vertical profiles by a factor of 3–7. The model shows small to no bias for most species in low-/no-smoke conditions. We thus attribute thenegative model biases mostly to underestimated BB emissions in these inventories. Tripling BB emissions in the model reproduces observed verticalprofiles for primary compounds, i.e., CO, propane, benzene, and toluene. However, it shows no to less significant improvements for oxygenatedVOCs, particularly for formaldehyde, formic acid, acetic acid, and lumped ≥ C3 aldehydes, suggesting the model is missing secondarysources of these compounds in BB-impacted environments. The underestimation of primary BB emissions in inventories is likely attributable tounderpredicted amounts of effective dry matter burned, rather than errors in fire detection, injection height, or ERs, as constrained by aircraftand ground measurements. We cannot rule out potential sub-grid uncertainties (i.e., not being able to fully resolve fire plumes) in the nestedGEOS-Chem which could explain the negative model bias partially, though back-of-the-envelope calculation and evaluation using longer-term groundmeasurements help support the argument of the dry matter burned underestimation. The total ERs of the 14 BB VOCs implemented in GEOS-Chem onlyaccount for half of the total 161 measured VOCs (∼ 75 versus 150 ppb ppm−1). This reveals a significant amount of missing reactiveorganic carbon in widely used BB emission inventories. Considering both uncertainties in effective dry matter burned (× 3) and unmodeledVOCs (× 2), we infer that BB contributed to 10 % in 2019 and 45 % in 2018 (240 and 2040 Gg C) of the total VOC primaryemission flux in the western USA during these two fire seasons, compared to only 1 %–10 % in the standard GEOS-Chem. 
    more » « less
  4. Abstract. The evaluation of aerosol radiative effect on broadband hemispherical solar flux is often performed using simplified spectral and directional scattering characteristics of atmospheric aerosol and underlying surface reflectance. In this study we present a rigorous yet fast computational tool that accurately accounts for detailed variability of both spectral and angular scattering properties of aerosol and surface reflectance in calculation of direct aerosol radiative effect. The tool is developed as part of the GRASP (Generalized Retrieval of Aerosol and Surface Properties) project. We use the tool to evaluate instantaneous and daily average radiative efficiencies (radiative effect per unit aerosol optical thickness) of several key atmospheric aerosol models over different surface types. We then examine the differences due to neglect of surface reflectance anisotropy, nonsphericity of aerosol particle shape and accounting only for aerosol angular scattering asymmetry instead of using full phase function. For example, it is shown that neglecting aerosol particle nonsphericity causes mainly overestimation of the aerosol cooling effect and that magnitude of this overestimate changes significantly as a function of solar zenith angle (SZA) if the asymmetry parameter is used instead of detailed phase function. It was also found that the nonspherical–spherical differences in the calculated aerosol radiative effect are not modified significantly if detailed BRDF (bidirectional reflectance distribution function) is used instead of Lambertian approximation of surface reflectance. Additionally, calculations show that usage of only angular scattering asymmetry, even for the case of spherical aerosols, modifies the dependence of instantaneous aerosol radiative effect on SZA. This effect can be canceled for daily average values, but only if sun reaches the zenith; otherwise a systematic bias remains. Since the daily average radiative effect is obtained by integration over a range of SZAs, the errors vary with latitude and season. In summary, the present analysis showed that use of simplified assumptions causes systematic biases, rather than random uncertainties, in calculation of both instantaneous and daily average aerosol radiative effect. Finally, we illustrate application of the rigorous aerosol radiative effect calculations performed as part of GRASP aerosol retrieval from real POLDER/PARASOL satellite observations. 
    more » « less
  5. The strategy for selecting candidate sets — the set of items that the recommendation system is expected to rank for each user — is an important decision in carrying out an offline top-N recommender system evaluation. The set of candidates is composed of the union of the user’s test items and an arbitrary number of non-relevant items that we refer to as decoys. Previous studies have aimed to understand the effect of different candidate set sizes and selection strategies on evaluation. In this paper, we extend this knowledge by studying the specific interaction of candidate set selection strategies with popularity bias, and use simulation to assess whether sampled candidate sets result in metric estimates that are less biased with respect to the true metric values under complete data that is typically unavailable in ordinary experiments. 
    more » « less