Human scene categorization is characterized by its remarkable speed. While many visual and conceptual features have been linked to this ability, significant correlations exist between feature spaces, impeding our ability to determine their relative contributions to scene categorization. Here, we used a whitening transformation to decorrelate a variety of visual and conceptual features and assess the time course of their unique contributions to scene categorization. Participants (both sexes) viewed 2250 full-color scene images drawn from 30 different scene categories while having their brain activity measured through 256-channel EEG. We examined the variance explained at each electrode and time point of visual event-related potential (vERP) data from nine different whitened encoding models. These ranged from low-level features obtained from filter outputs to high-level conceptual features requiring human annotation. The amount of category information in the vERPs was assessed through multivariate decoding methods. Behavioral similarity measures were obtained in separate crowdsourced experiments. We found that all nine models together contributed 78% of the variance of human scene similarity assessments and were within the noise ceiling of the vERP data. Low-level models explained earlier vERP variability (88 ms after image onset), whereas high-level models explained later variance (169 ms). Critically, only high-level models shared vERP variability with behavior. Together, these results suggest that scene categorization is primarily a high-level process, but reliant on previously extracted low-level features.
more »
« less
From Pixels to Scene Categories: Unique and Early Contributions of Functional and Visual Features
Human scene categorization is rapid and robust, but we have little understanding of how individual features contribute to categorization, nor the time scale of their contribution. This issue is compounded by the non- independence of the many candidate features. Here, we used singular value decomposition to orthogonalize 11 different scene descriptors that included both visual and semantic features. Using high-density EEG and regression analyses, we observed that most explained variability was carried by a late layer of a deep convolutional neural network, as well as a model of a scene’s functions given by the American Time Use Survey. Furthermore, features that explained more variance also tended to explain earlier variance. These results extend previous large-scale behavioral results showing the importance of functional features for scene categorization. Furthermore, these results fail to support models of visual perception that are encapsulated from higher-level cognitive attributes.
more »
« less
- Award ID(s):
- 1736274
- PAR ID:
- 10066328
- Date Published:
- Journal Name:
- Computational Cognitive Neuroscience
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Visual scene category representations emerge very rapidly, yet the computational transformations that enable such invariant categorizations remain elusive. Deep convolutional neural networks (CNNs) perform visual categorization at near human-level accuracy using a feedforward architecture, providing neuroscientists with the opportunity to assess one successful series of representational transformations that enable categorization in silico. The goal of the current study is to assess the extent to which sequential scene category representations built by a CNN map onto those built in the human brain as assessed by high-density, time-resolved event-related potentials (ERPs). We found correspondence both over time and across the scalp: earlier (0–200 ms) ERP activity was best explained by early CNN layers at all electrodes. Although later activity at most electrode sites corresponded to earlier CNN layers, activity in right occipito-temporal electrodes was best explained by the later, fully-connected layers of the CNN around 225 ms post-stimulus, along with similar patterns in frontal electrodes. Taken together, these results suggest that the emergence of scene category representations develop through a dynamic interplay between early activity over occipital electrodes as well as later activity over temporal and frontal electrodes.more » « less
-
While viewing a visual stimulus, we often cannot tell whether it is inherently memorable or forgettable. However, the memorability of a stimulus can be quantified and partially predicted by a collection of conceptual and perceptual factors. Higher-level properties that represent the “meaningfulness” of a visual stimulus to viewers best predict whether it will be remembered or forgotten across a population. Here, we hypothesize that the feelings evoked by an image, operationalized as the valence and arousal dimensions of affect, significantly contribute to the memorability of scene images. We ran two complementary experiments to investigate the influence of affect on scene memorability, in the process creating a new image set (VAMOS) of hundreds of natural scene images for which we obtained valence, arousal, and memorability scores. From our first experiment, we found memorability to be highly reliable for scene images that span a wide range of evoked arousal and valence. From our second experiment, we found that both valence and arousal are significant but weak predictors of image memorability. Scene images were most memorable if they were slightly negatively valenced and highly arousing. Images that were extremely positive or unarousing were most forgettable. Valence and arousal together accounted for less than 8% of the variance in image memorability. These findings suggest that evoked affect contributes to the overall memorability of a scene image but, like other singular predictors, does not fully explain it. Instead, memorability is best explained by an assemblage of visual features that combine, in perhaps unintuitive ways, to predict what is likely to stick in our memory.more » « less
-
Isik, Leyla (Ed.)After years of experience, humans become experts at perceiving letters. Is this visual capacity attained by learning specialized letter features, or by reusing general visual features previously learned in service of object categorization? To explore this question, we first measured the perceptual similarity of letters in two behavioral tasks, visual search and letter categorization. Then, we trained deep convolutional neural networks on either 26-way letter categorization or 1000-way object categorization, as a way to operationalize possible specialized letter features and general object-based features, respectively. We found that the general object-based features more robustly correlated with the perceptual similarity of letters. We then operationalized additional forms of experience-dependent letter specialization by altering object-trained networks with varied forms of letter training; however, none of these forms of letter specialization improved the match to human behavior. Thus, our findings reveal that it is not necessary to appeal to specialized letter representations to account for perceptual similarity of letters. Instead, we argue that it is more likely that the perception of letters depends on domain-general visual features.more » « less
-
We train embodied agents to play Visual Hide and Seek to study the relationship between agent behaviors and environmental complexity. In Visual Hide and Seek, a prey must navigate in a simulated environment in order to avoid capture from a predator, only relying on first-person visual observations. By probing different environmental factors, agents exhibit diverse hiding strategies and even the knowledge of its own visibility to other agents in the scene. Furthermore, we quantitatively analyze how agent weaknesses, such as slower speed, affect the learned policy. Our results suggest that, although agent weakness makes the learning problem more challenging, they also cause more useful features to be learned.more » « less
An official website of the United States government

