Benchmarking Gaze Prediction for Categorical Visual Search
The prediction of human shifts of attention is a widely-studied question in both behavioral and computer vision, especially in the context of a free viewing task. However, search behavior, where the fixation scanpaths are highly dependent on the viewer's goals, has received far less attention, even though visual search constitutes much of a person's everyday behavior. One reason for this is the absence of real-world image datasets on which search models can be trained. In this paper we present a carefully created dataset for two target categories, microwaves and clocks, curated from the COCO2014 dataset. A total of 2183 images were presented to multiple participants, who were tasked to search for one of the two categories. This yields a total of 16184 validated fixations used for training, making our microwave-clock dataset currently one of the largest datasets of eye fixations in categorical search. We also present a 40-image testing dataset, where images depict both a microwave and a clock target. Distinct fixation patterns emerged depending on whether participants searched for a microwave (n=30) or a clock (n=30) in the same images, meaning that models need to predict different search scanpaths from the same pixel inputs. We report the results of more »
Authors:
; ; ; ; ; ; ; ;
Award ID(s):
Publication Date:
NSF-PAR ID:
10091972
Journal Name:
CVPR Workshop - Mutual Benefits of Cognitive and Computer Vision
Attention control is a basic behavioral process that has been studied for decades. The currently best models of attention control are deep networks trained on free-viewing behavior to predict bottom-up attention control – saliency. We introduce COCO-Search18, the first dataset of laboratory-qualitygoal-directed behaviorlarge enough to train deep-network models. We collected eye-movement behavior from 10 people searching for each of 18 target-object categories in 6202 natural-scene images, yielding$$\sim$$$\sim$300,000 search fixations. We thoroughly characterize COCO-Search18, and benchmark it using three machine-learning methods: a ResNet50 object detector, a ResNet50 trained on fixation-density maps, and an inverse-reinforcement-learning model trained on behavioral search scanpaths. Models were also trained/tested on images transformed to approximate a foveated retina, a fundamental biological constraint. These models, each having a different reliance on behavioral training, collectively comprise the new state-of-the-art in predicting goal-directed search fixations. Our expectation is that future work using COCO-Search18 will far surpass these initial efforts, finding applications in domains ranging from human-computer interactive systems that can anticipate a person’s intent and render assistance to the potentially early identification of attention-related clinical disorders (ADHD, PTSD, phobia) based on deviation from neurotypical fixation behavior.