skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Human Eyes–Inspired Recurrent Neural Networks Are More Robust Against Adversarial Noises
Abstract Humans actively observe the visual surroundings by focusing on salient objects and ignoring trivial details. However, computer vision models based on convolutional neural networks (CNN) often analyze visual input all at once through a single feedforward pass. In this study, we designed a dual-stream vision model inspired by the human brain. This model features retina-like input layers and includes two streams: one determining the next point of focus (the fixation), while the other interprets the visuals surrounding the fixation. Trained on image recognition, this model examines an image through a sequence of fixations, each time focusing on different parts, thereby progressively building a representation of the image. We evaluated this model against various benchmarks in terms of object recognition, gaze behavior, and adversarial robustness. Our findings suggest that the model can attend and gaze in ways similar to humans without being explicitly trained to mimic human attention and that the model can enhance robustness against adversarial attacks due to its retinal sampling and recurrent processing. In particular, the model can correct its perceptual errors by taking more glances, setting itself apart from all feedforward-only models. In conclusion, the interactions of retinal sampling, eye movement, and recurrent dynamics are important to human-like visual exploration and inference.  more » « less
Award ID(s):
2112773
PAR ID:
10627033
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
MIT Press
Date Published:
Journal Name:
Neural Computation
Volume:
36
Issue:
9
ISSN:
0899-7667
Page Range / eLocation ID:
1713 to 1743
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Oh, A; Naumann, T; Globerson, A; Saenko, K; Hardt, M; Levine, S (Ed.)
    The human visual system uses two parallel pathways for spatial processing and object recognition. In contrast, computer vision systems tend to use a single feedforward pathway, rendering them less robust, adaptive, or efficient than human vision. To bridge this gap, we developed a dual-stream vision model inspired by the human eyes and brain. At the input level, the model samples two complementary visual patterns to mimic how the human eyes use magnocellular and parvocellular retinal ganglion cells to separate retinal inputs to the brain. At the backend, the model processes the separate input patterns through two branches of convolutional neural networks (CNN) to mimic how the human brain uses the dorsal and ventral cortical pathways for parallel visual processing. The first branch (WhereCNN) samples a global view to learn spatial attention and control eye movements. The second branch (WhatCNN) samples a local view to represent the object around the fixation. Over time, the two branches interact recurrently to build a scene representation from moving fixations. We compared this model with the human brains processing the same movie and evaluated their functional alignment by linear transformation. The WhereCNN and WhatCNN branches were found to differentially match the dorsal and ventral pathways of the visual cortex, respectively, primarily due to their different learning objectives, rather than their distinctions in retinal sampling or sensitivity to attention-driven eye movements. These model-based results lead us to speculate that the distinct responses and representations of the ventral and dorsal streams are more influenced by their distinct goals in visual attention and object recognition than by their specific bias or selectivity in retinal inputs. This dual-stream model takes a further step in brain-inspired computer vision, enabling parallel neural networks to actively explore and understand the visual surroundings. 
    more » « less
  2. Wei, Xue-Xin (Ed.)
    Machine learning models have difficulty generalizing to data outside of the distribution they were trained on. In particular, vision models are usually vulnerable to adversarial attacks or common corruptions, to which the human visual system is robust. Recent studies have found that regularizing machine learning models to favor brain-like representations can improve model robustness, but it is unclear why. We hypothesize that the increased model robustness is partly due to the low spatial frequency preference inherited from the neural representation. We tested this simple hypothesis with several frequency-oriented analyses, including the design and use of hybrid images to probe model frequency sensitivity directly. We also examined many other publicly available robust models that were trained on adversarial images or with data augmentation, and found that all these robust models showed a greater preference to low spatial frequency information. We show that preprocessing by blurring can serve as a defense mechanism against both adversarial attacks and common corruptions, further confirming our hypothesis and demonstrating the utility of low spatial frequency information in robust object recognition. 
    more » « less
  3. Humans routinely extract important information from images and videos, relying on their gaze. In contrast, computational systems still have difficulty annotating important visual information in a human-like manner, in part because human gaze is often not included in the modeling process. Human input is also particularly relevant for processing and interpreting affective visual information. To address this challenge, we captured human gaze, spoken language, and facial expressions simultaneously in an experiment with visual stimuli characterized by subjective and affective content. Observers described the content of complex emotional images and videos depicting positive and negative scenarios and also their feelings about the imagery being viewed. We explore patterns of these modalities, for example by comparing the affective nature of participant-elicited linguistic tokens with image valence. Additionally, we expand a framework for generating automatic alignments between the gaze and spoken language modalities for visual annotation of images. Multimodal alignment is challenging due to their varying temporal offset. We explore alignment robustness when images have affective content and whether image valence influences alignment results. We also study if word frequency-based filtering impacts results, with both the unfiltered and filtered scenarios performing better than baseline comparisons, and with filtering resulting in a substantial decrease in alignment error rate. We provide visualizations of the resulting annotations from multimodal alignment. This work has implications for areas such as image understanding, media accessibility, and multimodal data fusion. 
    more » « less
  4. Given the rich visual information available in each glance, humans can internally direct their visual attention to enhance goal-relevant information—a capacity often absent in standard vision models. Here we introduce cognitively and biologically-inspired long-range modulatory pathways to enable 'cognitive steering' in vision models. First, we show that models equipped with these feedback pathways naturally show improved image recognition, adversarial robustness, and increased brain alignment, relative to baseline models. Further, these feedback projections from the final layer of the vision backbone provide a meaningful steering interface, where goals can be specified as vectors in the output space. We show that there are effective ways to steer the model that dramatically improve recognition of categories in composite images of multiple categories, succeeding where baseline feed-forward models without flexible steering fail. And, our multiplicative modulatory motif prevents rampant hallucination of the top-down goal category, dissociating what the model is looking for, from what it is looking at. Thus, these long-range modulatory pathways enable new behavioral capacities for goal-directed visual encoding, offering a flexible communication interface between cognitive and visual systems. 
    more » « less
  5. Scientists have pondered the perceptual effects of ocular motion, and those of its counterpart, ocular stillness, for over 200 years. The unremitting ‘trembling of the eye’ that occurs even during gaze fixation was first noted by Jurin in 1738. In 1794, Erasmus Darwin documented that gaze fixation produces perceptual fading, a phenomenon rediscovered in 1804 by Ignaz Paul Vital Troxler. Studies in the twentieth century established that Jurin's ‘eye trembling’ consisted of three main types of ‘fixational’ eye movements, now called microsaccades (or fixational saccades), drifts and tremor. Yet, owing to the constant and minute nature of these motions, the study of their perceptual and physiological consequences has met significant technological challenges. Studies starting in the 1950s and continuing in the present have attempted to study vision during retinal stabilization—a technique that consists on shifting any and all visual stimuli presented to the eye in such a way as to nullify all concurrent eye movements—providing a tantalizing glimpse of vision in the absence of change. No research to date has achieved perfect retinal stabilization, however, and so other work has devised substitute ways to counteract eye motion, such as by studying the perception of afterimages or of the entoptic images formed by retinal vessels, which are completely stable with respect to the eye. Yet other research has taken the alternative tack to control eye motion by behavioural instruction to fix one's gaze or to keep one's gaze still, during concurrent physiological and/or psychophysical measurements. Here, we review the existing data—from historical and contemporary studies that have aimed to nullify or minimize eye motion—on the perceptual and physiological consequences of perfect versus imperfect fixation. We also discuss the accuracy, quality and stability of ocular fixation, and the bottom–up and top–down influences that affect fixation behaviour. This article is part of the themed issue ‘Movement suppression: brain mechanisms for stopping and stillness’. 
    more » « less