skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Phone in a Basket Looks Like a Knife in a Cup: Role-Filler Independence in Visual Processing
Abstract When a piece of fruit is in a bowl, and the bowl is on a table, we appreciate not only the individual objects and their features, but also the relations containment and support, which abstract away from the particular objects involved. Independent representation of roles (e.g., containers vs. supporters) and “fillers” of those roles (e.g., bowls vs. cups, tables vs. chairs) is a core principle of language and higher-level reasoning. But does such role-filler independence also arise in automatic visual processing? Here, we show that it does, by exploring a surprising error that such independence can produce. In four experiments, participants saw a stream of images containing different objects arranged in force-dynamic relations—e.g., a phone contained in a basket, a marker resting on a garbage can, or a knife sitting in a cup. Participants had to respond to a single target image (e.g., a phone in a basket) within a stream of distractors presented under time constraints. Surprisingly, even though participants completed this task quickly and accurately, they false-alarmed more often to images matching the target’s relational category than to those that did not—even when those images involved completely different objects. In other words, participants searching for a phone in a basket were more likely to mistakenly respond to a knife in a cup than to a marker on a garbage can. Follow-up experiments ruled out strategic responses and also controlled for various confounding image features. We suggest that visual processing represents relations abstractly, in ways that separate roles from fillers.  more » « less
Award ID(s):
2021053
PAR ID:
10559101
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Open Mind
Date Published:
Journal Name:
Open Mind
Volume:
8
ISSN:
2470-2986
Page Range / eLocation ID:
766 to 794
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We see the external world as consisting not only of objects and their parts, but also of relations that hold between them. Visual analogy, which depends on similarities between relations, provides a clear example of how perception supports reasoning. Here we report an experiment in which we quantitatively measured the human ability to find analogical mappings between parts of different objects, where the objects to be compared were drawn either from the same category (e.g., images of two mammals, such as a dog and a horse), or from two dissimilar categories (e.g., a chair image mapped to a cat image). Humans showed systematic mapping patterns, but with greater variability in mapping responses when objects were drawn from dissimilar categories. We simulated the human response of analogical mapping using a computational model of mapping between 3D objects, visiPAM (visual Probabilistic Analogical Mapping). VisiPAM takes point-cloud representations of two 3D objects as inputs, and outputs the mapping between analogous parts of the two objects. VisiPAM consists of a visual module that constructs structural representations of individual objects, and a reasoning module that identifies a probabilistic mapping between parts of the two 3D objects. Model simulations not only capture the qualitative pattern of human mapping performance cross conditions, but also approach human-level reliability in solving visual analogy problems. 
    more » « less
  2. In natural language processing, most models try to learn semantic representations merely from texts. The learned representations encode the “distributional semantics” but fail to connect to any knowledge about the physical world. In contrast, humans learn language by grounding concepts in perception and action and the brain encodes “grounded semantics” for cognition. Inspired by this notion and recent work in vision-language learning, we design a two-stream model for grounding language learning in vision. The model includes a VGG-based visual stream and a Bert-based language stream. The two streams merge into a joint representational space. Through cross-modal contrastive learning, the model first learns to align visual and language representations with the MS COCO dataset. The model further learns to retrieve visual objects with language queries through a cross-modal attention module and to infer the visual relations between the retrieved objects through a bilinear operator with the Visual Genome dataset. After training, the model’s language stream is a stand-alone language model capable of embedding concepts in a visually grounded semantic space. This semantic space manifests principal dimensions explainable with human intuition and neurobiological knowledge. Word embeddings in this semantic space are predictive of human-defined norms of semantic features and are segregated into perceptually distinctive clusters. Furthermore, the visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on images, texts, or their combinations. 
    more » « less
  3. Tactile sensing has been increasingly utilized in robot control of unknown objects to infer physical properties and optimize manipulation. However, there is limited understanding about the contribution of different sensory modalities during interactive perception in complex interaction both in robots and in humans. This study investigated the effect of visual and haptic information on humans’ exploratory interactions with a ‘cup of coffee’, an object with nonlinear internal dynamics. Subjects were instructed to rhythmically transport a virtual cup with a rolling ball inside between two targets at a specified frequency, using a robotic interface. The cup and targets were displayed on a screen, and force feedback from the cup-andball dynamics was provided via the robotic manipulandum. Subjects were encouraged to explore and prepare the dynamics by “shaking” the cup-and-ball system to find the best initial conditions prior to the task. Two groups of subjects received the full haptic feedback about the cup-and-ball movement during the task; however, for one group the ball movement was visually occluded. Visual information about the ball movement had two distinctive effects on the performance: it reduced preparation time needed to understand the dynamics and, importantly, it led to simpler, more linear input-output interactions between hand and object. The results highlight how visual and haptic information regarding nonlinear internal dynamics have distinct roles for the interactive perception of complex objects. 
    more » « less
  4. The ability to correctly determine the position of objects in space is a fundamental task of the visual system. The perceived position of briefly presented static objects can be influenced by nearby moving contours, as demonstrated by various illusions collectively known as motion-induced position shifts. Here we use a stimulus that produces a particularly strong effect of motion on perceived position. We test whether several regions-of-interest (ROIs), at different stages of visual processing, encode the perceived rather than retinotopically veridical position. Specifically, we collect functional MRI data while participants experience motion-induced position shifts and use a multivariate pattern analysis approach to compare the activation patterns evoked by illusory position shifts with those evoked by matched physical shifts. We find that the illusory perceived position is represented at the earliest stages of the visual processing stream, including primary visual cortex. Surprisingly, we found no evidence of percept-based encoding of position in visual areas beyond area V3. This result suggests that while it is likely that higher-level visual areas are involved in position encoding, early visual cortex also plays an important role. 
    more » « less
  5. Fumero, Marco; Rodolà, Emanuele; Domine, Clementine; Locatello, Francesco; Dziugaite, Gintare Karolina; Caron, Mathilde (Ed.)
    We present an anatomically-inspired neurocomputational model, including a foveated retina and the log-polar mapping from the visual field to the primary visual cortex, that recreates image inversion effects long seen in psychophysical studies. We show that visual expertise, the ability to discriminate between subordinate-level categories, changes the performance of the model on inverted images. We first explore face discrimination, which, in humans, relies on configural information. The log-polar transform disrupts configural information in an inverted image and leaves featural information relatively unaffected. We suggest this is responsible for the degradation of performance with inverted faces. We then recreate the effect with other subordinate-level category discriminators and show that the inversion effect arises as a result of visual expertise, where configural information becomes relevant as more identities are learned at the subordinate-level. Our model matches the classic result: faces suffer more from inversion than mono-oriented objects, which are more disrupted than non-mono-oriented objects when objects are only familiar at a basic-level, and simultaneously shows that expert-level discrimination of other subordinate-level categories respond similarly to inversion as face experts. 
    more » « less