skip to main content


Title: Same-different conceptualization: a machine vision perspective
The goal of this review is to bring together material from cognitive psychology with recent machine vision studies to identify plausible neural mechanisms for visual same-different discrimination and relational understanding. We highlight how developments in the study of artificial neural networks provide computational evidence implicating attention and working memory in the ascertaining of visual relations, including same- different relations. We review some recent attempts to incorporate these mechanisms into flexible models of visual reasoning. Particular attention is given to recent models jointly trained on visual and linguistic information. These recent systems are promising, but they still fall short of the biological standard in several ways, which we outline in a final section.  more » « less
Award ID(s):
1912280 1740741
NSF-PAR ID:
10205786
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Current opinion in behavioral sciences
Volume:
37
ISSN:
2352-1546
Page Range / eLocation ID:
47-55
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    It has been debated whether salient distractors in visual search can be proactively suppressed to completely prevent attentional capture, as the occurrence of proactive suppression implies that the initial shift of attention is not entirely driven by physical salience. While the presence of a Pd component in the EEG (associated with suppression) without a preceding N2pc component (associated with selection) has been used as evidence for proactive suppression, the link between these ERPs and the underlying mechanisms is not always clear. This is exemplified in two recent articles that observed the same waveform pattern, where an early Pd-like component flipped to a N2pc-like component, but provided vastly different interpretations (Drisdelle, B. L., & Eimer, E. PD components and distractor inhibition in visual search: New evidence for the signal suppression hypothesis. Psychophysiology, 58, e13898, 2021; Kerzel, D., & Burra, N. Capture by context elements, not attentional suppression of distractors, explains the PD with small search displays. Journal of Cognitive Neuroscience, 32, 1170–1183, 2020). Using RAGNAROC (Wyble et al., Understanding visual attention with RAGNAROC: A Reflexive Attention Gradient through Neural AttRactOr Competition. Psychological Review, 127, 1163–1198, 2020), a computational model of reflexive attention, we successfully simulated this ERP pattern with minimal changes to its existing architecture, providing a parsimonious and mechanistic explanation for this flip in the EEG that is unique from both of the previous interpretations. Our account supports the occurrence of proactive suppression and demonstrates the benefits of incorporating computational modeling into theory building.

     
    more » « less
  2. null (Ed.)
    The development of deep convolutional neural networks (CNNs) has recently led to great successes in computer vision and CNNs have become de facto computational models of vision. However, a growing body of work suggests that they exhibit critical limitations beyond image categorization. Here, we study one such fundamental limitation, for judging whether two simultaneously presented items are the same or different (SD) compared to a baseline assessment of their spatial relationship (SR). In both human subjects and artificial neural networks, we test the prediction that SD tasks recruit additional cortical mechanisms which underlie critical aspects of visual cognition that are not explained by current computational models. We thus recorded EEG signals from human participants engaged in the same tasks as the computational models. Importantly, in humans the two tasks were matched in terms of difficulty by an adaptive psychometric procedure: yet, on top of a modulation of evoked potentials, our results revealed higher activity in the low beta (16-24Hz) band in the SD compared to the SR conditions. We surmise that these oscillations reflect the crucial involvement of additional mechanisms, such as working memory and attention, which are missing in current feed-forward CNNs. 
    more » « less
  3. Graphs are powerful representations for relations among objects, which have attracted plenty of attention in both academia and industry. A fundamental challenge for graph learning is how to train an effective Graph Neural Network (GNN) encoder without labels, which are expensive and time consuming to obtain. Contrastive Learning (CL) is one of the most popular paradigms to address this challenge, which trains GNNs by discriminating positive and negative node pairs. Despite the success of recent CL methods, there are still two under-explored problems. Firstly, how to reduce the semantic error introduced by random topology based data augmentations. Traditional CL defines positive and negative node pairs via the node-level topological proximity, which is solely based on the graph topology regardless of the semantic information of node attributes, and thus some semantically similar nodes could be wrongly treated as negative pairs. Secondly, how to effectively model the multiplexity of the real-world graphs, where nodes are connected by various relations and each relation could form a homogeneous graph layer. To solve these problems, we propose a novel multiplex heterogeneous graph prototypical contrastive leaning (X-GOAL) framework to extract node embeddings. X-GOAL is comprised of two components: the GOAL framework, which learns node embeddings for each homogeneous graph layer, and an alignment regularization, which jointly models different layers by aligning layer-specific node embeddings. Specifically, the GOAL framework captures the node-level information by a succinct graph transformation technique, and captures the cluster-level information by pulling nodes within the same semantic cluster closer in the embedding space. The alignment regularization aligns embeddings across layers at both node level and cluster level. We evaluate the proposed X-GOAL on a variety of real-world datasets and downstream tasks to demonstrate the effectiveness of the X-GOAL framework. 
    more » « less
  4. Abstract

    Speech processing often occurs amid competing inputs from other modalities, for example, listening to the radio while driving. We examined the extent to which dividing attention between auditory and visual modalities (bimodal divided attention) impacts neural processing of natural continuous speech from acoustic to linguistic levels of representation. We recorded electroencephalographic (EEG) responses when human participants performed a challenging primary visual task, imposing low or high cognitive load while listening to audiobook stories as a secondary task. The two dual-task conditions were contrasted with an auditory single-task condition in which participants attended to stories while ignoring visual stimuli. Behaviorally, the high load dual-task condition was associated with lower speech comprehension accuracy relative to the other two conditions. We fitted multivariate temporal response function encoding models to predict EEG responses from acoustic and linguistic speech features at different representation levels, including auditory spectrograms and information-theoretic models of sublexical-, word-form-, and sentence-level representations. Neural tracking of most acoustic and linguistic features remained unchanged with increasing dual-task load, despite unambiguous behavioral and neural evidence of the high load dual-task condition being more demanding. Compared to the auditory single-task condition, dual-task conditions selectively reduced neural tracking of only some acoustic and linguistic features, mainly at latencies >200 ms, while earlier latencies were surprisingly unaffected. These findings indicate that behavioral effects of bimodal divided attention on continuous speech processing occur not because of impaired early sensory representations but likely at later cognitive processing stages. Crossmodal attention-related mechanisms may not be uniform across different speech processing levels.

     
    more » « less
  5. Individual differences in expertise with non-face objects has been positively related to neural selectivity for these objects in several brain regions, including in the fusiform face area (FFA). Recently, we reported that FFA’s cortical thickness is also positively correlated with expertise for non-living objects, while FFA’s cortical thickness is negatively correlated with face recognition ability. These opposite relations between structure and visual abilities, obtained in the same subjects, were postulated to reflect the earlier experience with faces relative to cars, with different mechanisms of plasticity operating at these different developmental times. Here we predicted that variability for faces, presumably reflecting pruning, would be found selectively in deep cortical layers. In 13 men selected to vary in their performance with faces, we used ultra-high field imaging (7 Tesla), we localized the FFA functionally and collected and averaged 6 ultra-high resolution susceptibility weighed images (SWI). Voxel dimensions were 0.194x0.194x1.00mm, covering 20 slices with 0.1mm gap. Images were then processed by two operators blind to behavioral results to define the gray matter/white matter (deep) and gray matter/CSF (superficial) cortical boundaries. Internal boundaries between presumed deep, middle and superficial cortical layers were obtained with an automated method based on image intensities. We used an extensive battery of behavioral tests to quantify both face and object recognition ability. We replicate prior work with face and non-living object recognition predicting large and independent parts of the variance in cortical thickness of the right FFA, in different directions. We also find that face recognition is specifically predicted by the thickness of the deep cortical layers in FFA, whereas recognition of vehicles relates to the thickness of all cortical layers. Our results represent the most precise structural correlate of a behavioral ability to date, linking face recognition ability to a specific layer of a functionally-defined area. 
    more » « less