skip to main content


Title: State-Aware Meta-Evaluation of Evaluation Metrics in Interactive Information Retrieval
In interactive IR (IIR), users often seek to achieve different goals (e.g. exploring a new topic, finding a specific known item) at different search iterations and thus may evaluate system performances differently. Without state-aware approach, it would be extremely difficult to simulate and achieve real-time adaptive search evaluation and recommendation. To address this gap, our work identifies users' task states from interactive search sessions and meta-evaluates a series of online and offline evaluation metrics under varying states based on a user study dataset consisting of 1548 unique query segments from 450 search sessions. Our results indicate that: 1) users' individual task states can be identified and predicted from search behaviors and implicit feedback; 2) the effectiveness of mainstream evaluation measures (measured based upon their respective correlations with user satisfaction) vary significantly across task states. This study demonstrates the implicit heterogeneity in user-oriented IR evaluation and connects studies on complex search tasks with evaluation techniques. It also informs future research on the design of state-specific, adaptive user models and evaluation metrics.  more » « less
Award ID(s):
2106152
PAR ID:
10328992
Author(s) / Creator(s):
;
Date Published:
Journal Name:
30th ACM International Conference on Information & Knowledge Management
Page Range / eLocation ID:
3258 to 3262
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Evaluation metrics such as precision, recall and normalized discounted cumulative gain have been widely applied inad hocretrieval experiments. They have facilitated the assessment of system performance in various topics over the past decade. However, the effectiveness of such metrics in capturing users’ in-situ search experience, especially in complex search tasks that trigger interactive search sessions, is limited. To address this challenge, it is necessary to adaptively adjust the evaluation strategies of search systems to better respond to users’ changing information needs and evaluation criteria. In this work, we adopt a taxonomy of search task states that a user goes through in different scenarios and moments of search sessions, and perform a meta-evaluation of existing metrics to better understand their effectiveness in measuring user satisfaction. We then built models for predicting task states behind queries based on in-session signals. Furthermore, we constructed and meta-evaluated new state-aware evaluation metrics. Our analysis and experimental evaluation are performed on two datasets collected from a field study and a laboratory study, respectively. Results demonstrate that the effectiveness of individual evaluation metrics varies across task states. Meanwhile, task states can be detected from in-session signals. Our new state-aware evaluation metrics could better reflect in-situ user satisfaction than an extensive list of the widely used measures we analyzed in this work in certain states. Findings of our research can inspire the design and meta-evaluation of user-centered adaptive evaluation metrics, and also shed light on the development of state-aware interactive search systems.

     
    more » « less
  2. Previous researches demonstrate that users’ actions in search interaction are associated with relative gains and losses to reference points, known as the reference dependence effect. However, this widely confirmed effect is not represented in most user models underpinning existing search evaluation metrics. In this study, we propose a new evaluation metric framework, namely Reference Dependent Metric (ReDeM), for assessing query-level search by incorporating the effect of reference dependence into the modelling of user search behavior. To test the overall effectiveness of the proposed framework, (1) we evaluate the performance, in terms of correlation with user satisfaction, of ReDeMs built upon different reference points against that of the widely-used metrics on three search datasets; (2) we examine the performance of ReDeMs under different task states, like task difficulty and task urgency; and (3) we analyze the statistical reliability of ReDeMs in terms of discriminative power. Experimental results indicate that: (1) ReDeMs integrated with a proper reference point achieve better correlations with user satisfaction than most of the existing metrics, like Discounted Cumulative Gain (DCG) and Rank-Biased Precision (RBP), even though their parameters have already been well-tuned; (2) ReDeMs reach relatively better performance compared to existing metrics when the task triggers a high-level cognitive load; (3) the discriminative power of ReDeMs is far stronger than Expected Reciprocal Rank (ERR), slightly stronger than Precision and similar to DCG, RBP and INST. To our knowledge, this study is the first to explicitly incorporate the reference dependence effect into the user browsing model and offline evaluation metrics. Our work illustrates a promising approach to leveraging the insights about user biases from cognitive psychology in better evaluating user search experience and enhancing user models. 
    more » « less
  3. ABSTRACT

    User search performance is multidimensional in nature and may be better characterized by metrics that depict users' interactions with both relevant and irrelevant results. Despite previous research on one‐dimensional measures, it is still unclear how to characterize different dimensions of user performance and leverage the knowledge in developing proactive recommendations. To address this gap, we propose and empirically test a framework of search performance evaluation and build early performance prediction models to simulate proactive search path recommendations. Experimental results from four datasets of diverse types (1,482 sessions and 5,140 query segments from both controlled lab and natural settings) demonstrate that: 1) Cluster patterns characterized by cost‐gain‐based multifaceted metrics can effectively differentiate high‐performing users from other searchers, which form the empirical basis for proactive recommendations; 2) whole‐session performance can be reliably predicted at early stages of sessions (e.g., first and second queries); 3) recommendations built upon the search paths of system‐identified high‐performing searchers can significantly improve the search performance of struggling users. Experimental results demonstrate the potential of our approach for leveraging collective wisdom from automatically identified high‐performance user groups in developing and evaluating proactive in‐situ search recommendations.

     
    more » « less
  4. Abstract

    Understanding the roles ofsearch gainandcostin users' search decision‐making is a key topic in interactive information retrieval (IIR). While previous research has developed user models based onsimulatedgains and costs, it is unclear how users' actualperceptions of search gains and costsform and change during search interactions. To address this gap, our study adopted expectation‐confirmation theory (ECT) to investigate users' perceptions of gains and costs. We re‐analyzed data from our previous study, examining how contextual and search features affect users' perceptions and how their expectation‐confirmation states impact their following searches. Our findings include: (1) The point where users' actual dwell time meets their constant expectation may serve as a reference point in evaluating perceived gain and cost; (2) these perceptions are associated with in situ experience represented by usefulness labels, browsing behaviors, and queries; (3) users' current confirmation states affect their perceptions of Web page usefulness in the subsequent query. Our findings demonstrate possible effects of expectation‐confirmation, prospect theory, and information foraging theory, highlighting the complex relationships among gain/cost, expectations, and dwell time at the query level, and the reference‐dependent expectation at the session level. These insights enrich user modeling and evaluation in human‐centered IR.

     
    more » « less
  5. Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)
    The evaluation of machine learning algorithms in biomedical fields for ap-plications involving sequential data lacks both rigor and standardization. Common quantitative scalar evaluation metrics such as sensitivity and specificity can often be misleading and not accurately integrate application requirements. Evaluation metrics must ultimately reflect the needs of users yet be sufficiently sensitive to guide algorithm development. For example, feedback from critical care clinicians who use automated event detection software in clinical applications has been overwhelmingly emphatic that a low false alarm rate, typically measured in units of the number of errors per 24 hours, is the single most important criterion for user acceptance. Though using a single metric is not often as insightful as examining performance over a range of operating conditions, there is, nevertheless, a need for a sin-gle scalar figure of merit. In this chapter, we discuss the deficiencies of existing metrics for a seizure detection task and propose several new metrics that offer a more balanced view of performance. We demonstrate these metrics on a seizure detection task based on the TUH EEG Seizure Corpus. We introduce two promising metrics: (1) a measure based on a concept borrowed from the spoken term detection literature, Actual Term-Weighted Value, and (2) a new metric, Time-Aligned Event Scoring (TAES), that accounts for the temporal align-ment of the hypothesis to the reference annotation. We demonstrate that state of the art technology based on deep learning, though impressive in its performance, still needs significant improvement before it will meet very strict user acceptance guidelines. 
    more » « less