skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cognitive steering in deep neural networks via long-range modulatory feedback connections
Given the rich visual information available in each glance, humans can internally direct their visual attention to enhance goal-relevant information—a capacity often absent in standard vision models. Here we introduce cognitively and biologically-inspired long-range modulatory pathways to enable 'cognitive steering' in vision models. First, we show that models equipped with these feedback pathways naturally show improved image recognition, adversarial robustness, and increased brain alignment, relative to baseline models. Further, these feedback projections from the final layer of the vision backbone provide a meaningful steering interface, where goals can be specified as vectors in the output space. We show that there are effective ways to steer the model that dramatically improve recognition of categories in composite images of multiple categories, succeeding where baseline feed-forward models without flexible steering fail. And, our multiplicative modulatory motif prevents rampant hallucination of the top-down goal category, dissociating what the model is looking for, from what it is looking at. Thus, these long-range modulatory pathways enable new behavioral capacities for goal-directed visual encoding, offering a flexible communication interface between cognitive and visual systems.  more » « less
Award ID(s):
2309041
PAR ID:
10518307
Author(s) / Creator(s):
;
Publisher / Repository:
https://proceedings.neurips.cc
Date Published:
Journal Name:
Proceedings of the 37th International Conference on Neural Information Processing Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Visual perception involves the rapid formation of a coarse image representation at the onset of visual processing, which is iteratively refined by late computational processes. These early versus late time windows approximately map onto feedforward and feedback processes, respectively. State‐of‐the‐art convolutional neural networks, the main engine behind recent machine vision successes, are feedforward architectures. Their successes and limitations provide critical information regarding which visual tasks can be solved by purely feedforward processes and which require feedback mechanisms. We provide an overview of recent work in cognitive neuroscience and machine vision that highlights the possible role of feedback processes for both visual recognition and beyond. We conclude by discussing important open questions for future research. 
    more » « less
  2. The goal of this review is to bring together material from cognitive psychology with recent machine vision studies to identify plausible neural mechanisms for visual same-different discrimination and relational understanding. We highlight how developments in the study of artificial neural networks provide computational evidence implicating attention and working memory in the ascertaining of visual relations, including same- different relations. We review some recent attempts to incorporate these mechanisms into flexible models of visual reasoning. Particular attention is given to recent models jointly trained on visual and linguistic information. These recent systems are promising, but they still fall short of the biological standard in several ways, which we outline in a final section. 
    more » « less
  3. null (Ed.)
    Animals rapidly collect and act on incoming information to navigate complex environments, making the precise timing of sensory feedback critical in the context of neural circuit function. Moreover, the timing of sensory input determines the biomechanical properties of muscles that undergo cyclic length changes, as during locomotion. Both of these issues come to a head in the case of flying insects, as these animals execute steering manoeuvres at timescales approaching the upper limits of performance for neuromechanical systems. Among insects, flies stand out as especially adept given their ability to execute manoeuvres that require sub-millisecond control of steering muscles. Although vision is critical, here I review the role of rapid, wingbeat-synchronous mechanosensory feedback from the wings and structures unique to flies, the halteres. The visual system and descending interneurons of the brain employ a spike rate coding scheme to relay commands to the wing steering system. By contrast, mechanosensory feedback operates at faster timescales and in the language of motor neurons, i.e. spike timing, allowing wing and haltere input to dynamically structure the output of the wing steering system. Although the halteres have been long known to provide essential input to the wing steering system as gyroscopic sensors, recent evidence suggests that the feedback from these vestigial hindwings is under active control. Thus, flies may accomplish manoeuvres through a conserved hindwing circuit, regulating the firing phase—and thus, the mechanical power output—of the wing steering muscles. 
    more » « less
  4. Human gaze behavior prediction is important for behavioral vision and for computer vision applications. Most models mainly focus on predicting free-viewing behavior using saliency maps, but do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. We modeled the viewer’s internal belief states as dynamic contextual belief maps of object locations. These maps were learned and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of highquality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive targetdependent patterns of object prioritization, which we interpret as a learned object context. 
    more » « less
  5. In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object's motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations and remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that define kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the-art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a 23% improvement in model accuracy over the vision-only baseline. 
    more » « less