skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Constructing and meta-evaluating state-aware evaluation metrics for interactive search systems
Abstract Evaluation metrics such as precision, recall and normalized discounted cumulative gain have been widely applied inad hocretrieval experiments. They have facilitated the assessment of system performance in various topics over the past decade. However, the effectiveness of such metrics in capturing users’ in-situ search experience, especially in complex search tasks that trigger interactive search sessions, is limited. To address this challenge, it is necessary to adaptively adjust the evaluation strategies of search systems to better respond to users’ changing information needs and evaluation criteria. In this work, we adopt a taxonomy of search task states that a user goes through in different scenarios and moments of search sessions, and perform a meta-evaluation of existing metrics to better understand their effectiveness in measuring user satisfaction. We then built models for predicting task states behind queries based on in-session signals. Furthermore, we constructed and meta-evaluated new state-aware evaluation metrics. Our analysis and experimental evaluation are performed on two datasets collected from a field study and a laboratory study, respectively. Results demonstrate that the effectiveness of individual evaluation metrics varies across task states. Meanwhile, task states can be detected from in-session signals. Our new state-aware evaluation metrics could better reflect in-situ user satisfaction than an extensive list of the widely used measures we analyzed in this work in certain states. Findings of our research can inspire the design and meta-evaluation of user-centered adaptive evaluation metrics, and also shed light on the development of state-aware interactive search systems.  more » « less
Award ID(s):
2106152
PAR ID:
10543480
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Nature
Date Published:
Journal Name:
Information Retrieval Journal
Volume:
26
Issue:
1-2
ISSN:
1386-4564
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In interactive IR (IIR), users often seek to achieve different goals (e.g. exploring a new topic, finding a specific known item) at different search iterations and thus may evaluate system performances differently. Without state-aware approach, it would be extremely difficult to simulate and achieve real-time adaptive search evaluation and recommendation. To address this gap, our work identifies users' task states from interactive search sessions and meta-evaluates a series of online and offline evaluation metrics under varying states based on a user study dataset consisting of 1548 unique query segments from 450 search sessions. Our results indicate that: 1) users' individual task states can be identified and predicted from search behaviors and implicit feedback; 2) the effectiveness of mainstream evaluation measures (measured based upon their respective correlations with user satisfaction) vary significantly across task states. This study demonstrates the implicit heterogeneity in user-oriented IR evaluation and connects studies on complex search tasks with evaluation techniques. It also informs future research on the design of state-specific, adaptive user models and evaluation metrics. 
    more » « less
  2. Previous researches demonstrate that users’ actions in search interaction are associated with relative gains and losses to reference points, known as the reference dependence effect. However, this widely confirmed effect is not represented in most user models underpinning existing search evaluation metrics. In this study, we propose a new evaluation metric framework, namely Reference Dependent Metric (ReDeM), for assessing query-level search by incorporating the effect of reference dependence into the modelling of user search behavior. To test the overall effectiveness of the proposed framework, (1) we evaluate the performance, in terms of correlation with user satisfaction, of ReDeMs built upon different reference points against that of the widely-used metrics on three search datasets; (2) we examine the performance of ReDeMs under different task states, like task difficulty and task urgency; and (3) we analyze the statistical reliability of ReDeMs in terms of discriminative power. Experimental results indicate that: (1) ReDeMs integrated with a proper reference point achieve better correlations with user satisfaction than most of the existing metrics, like Discounted Cumulative Gain (DCG) and Rank-Biased Precision (RBP), even though their parameters have already been well-tuned; (2) ReDeMs reach relatively better performance compared to existing metrics when the task triggers a high-level cognitive load; (3) the discriminative power of ReDeMs is far stronger than Expected Reciprocal Rank (ERR), slightly stronger than Precision and similar to DCG, RBP and INST. To our knowledge, this study is the first to explicitly incorporate the reference dependence effect into the user browsing model and offline evaluation metrics. Our work illustrates a promising approach to leveraging the insights about user biases from cognitive psychology in better evaluating user search experience and enhancing user models. 
    more » « less
  3. There is substantial evidence from behavioral economics and decision sciences demonstrating that in the context of decision-making under uncertainty, the carriers of value behind actions are gains and losses defined relative to a reference point (e.g. pre-action expectations), rather than the absolute final outcomes. Also, the capability of early predicting session-level search decisions and user experience is essential for developing reactive and proactive search recommendations. To address these research gaps, our study aims to 1) develop reference dependence features based on a series of simulated user expectations or reference points in first query segments of sessions, and 2) examine the extent to which we can enhance the performance of early predicting session behavior and user satisfaction by constructing and employing reference dependence features. Based on the experimental results on three datasets of varying types, we found that incorporating reference dependent features developed in first query segments into prediction models achieves better performance than using baseline cost-benefit features only in early predicting three key session metrics (user satisfaction score, session clicks, and session dwell time). Also, when running simulations by varying the search time expectation and rate of user satisfaction decay, the results demonstrate that users tended to expect to complete their search within a minute and showed a rapid rate of satisfaction decay in a logarithmic fashion once surpassing the estimated expectation points. By factoring in a user's search time expectation and measuring their behavioral response once the expectation is not met, we can further improve the performance of early prediction models and enhance our understanding of users' behavioral patterns. 
    more » « less
  4. ABSTRACT User search performance is multidimensional in nature and may be better characterized by metrics that depict users' interactions with both relevant and irrelevant results. Despite previous research on one‐dimensional measures, it is still unclear how to characterize different dimensions of user performance and leverage the knowledge in developing proactive recommendations. To address this gap, we propose and empirically test a framework of search performance evaluation and build early performance prediction models to simulate proactive search path recommendations. Experimental results from four datasets of diverse types (1,482 sessions and 5,140 query segments from both controlled lab and natural settings) demonstrate that: 1) Cluster patterns characterized by cost‐gain‐based multifaceted metrics can effectively differentiate high‐performing users from other searchers, which form the empirical basis for proactive recommendations; 2) whole‐session performance can be reliably predicted at early stages of sessions (e.g., first and second queries); 3) recommendations built upon the search paths of system‐identified high‐performing searchers can significantly improve the search performance of struggling users. Experimental results demonstrate the potential of our approach for leveraging collective wisdom from automatically identified high‐performance user groups in developing and evaluating proactive in‐situ search recommendations. 
    more » « less
  5. Abstract Understanding the roles ofsearch gainandcostin users' search decision‐making is a key topic in interactive information retrieval (IIR). While previous research has developed user models based onsimulatedgains and costs, it is unclear how users' actualperceptions of search gains and costsform and change during search interactions. To address this gap, our study adopted expectation‐confirmation theory (ECT) to investigate users' perceptions of gains and costs. We re‐analyzed data from our previous study, examining how contextual and search features affect users' perceptions and how their expectation‐confirmation states impact their following searches. Our findings include: (1) The point where users' actual dwell time meets their constant expectation may serve as a reference point in evaluating perceived gain and cost; (2) these perceptions are associated with in situ experience represented by usefulness labels, browsing behaviors, and queries; (3) users' current confirmation states affect their perceptions of Web page usefulness in the subsequent query. Our findings demonstrate possible effects of expectation‐confirmation, prospect theory, and information foraging theory, highlighting the complex relationships among gain/cost, expectations, and dwell time at the query level, and the reference‐dependent expectation at the session level. These insights enrich user modeling and evaluation in human‐centered IR. 
    more » « less