skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 1946308

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Anterior regions of the ventral visual stream encode substantial information about object categories. Are top-down category-level forces critical for arriving at this representation, or can this representation be formed purely through domain-general learning of natural image structure? Here we present a fully self-supervised model which learns to represent individual images, rather than categories, such that views of the same image are embedded nearby in a low-dimensional feature space, distinctly from other recently encountered views. We find that category information implicitly emerges in the local similarity structure of this feature space. Further, these models learn hierarchical features which capture the structure of brain responses across the human ventral visual stream, on par with category-supervised models. These results provide computational support for a domain-general framework guiding the formation of visual representation, where the proximate goal is not explicitly about category information, but is instead to learn unique, compressed descriptions of the visual world. 
    more » « less
  2. Self-supervised Vision Transformers (ViTs) like DINOv2 show strong holistic shape processing capabilities, a feature linked to computations in their intermediate layers. However, the specific mechanism by which these layers transform local patch information into a global, configural percept remains a black box. To dis- sect this process, we conduct fine-grained mechanistic analyses by disentangling patch representations into their constituent content and positional information. We find that high-performing models demonstrate a distinct multi-stage processing signature: they first preserve the spatial localization of image content through many layers while concurrently refining their positional representations. Compu- tationally, we show that this is supported by a systematic "local-global handoff," where attention heads gradually shift to aggregating information using long-range interactions. In contrast, models with poor configural ability lose content-specific spatial information early and lack this critical positional refinement stage. This positional refinement is further stabilized by register tokens, which mitigate a common artifact in ViTs; repurpose low-information patch tokens into high-norm ’outliers’ to store global information, causing them to lose their local positional grounding. By isolating these high-norm activations in register tokens, the model better preserves the visual grounding of each patch, which we show also leads to a direct improvement in holistic processing. Overall, our findings suggest that holis- tic vision in ViTs arises not just from long-range attention, but from a structured pipeline that carefully manages the interpl 
    more » « less
    Free, publicly-accessible full text available December 2, 2026
  3. Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs- texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self- supervised and language-aligned transformers – exemplified by DINOv2, SigLIP2 and EVA-CLIP – occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius- controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control, whose receptive fields straddle patch seams, remains at chance (iv), ruling out any “border-hacking” strategies. Finally, (v) we show that configural shape score also predicts other shape- dependent evals (e.g.,foreground bias, spectral and noise robustness). Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape 
    more » « less
    Free, publicly-accessible full text available December 2, 2026
  4. Deep neural network models provide a powerful experimental platform for exploring core mechanisms underlying human visual perception, such as perceptual grouping and contour integration—the process of linking local edge elements to arrive at a unified perceptual representation of a complete contour. Here, we demonstrate that feedforward convolutional neural networks (CNNs) fine-tuned on contour detection show this human-like capacity, but without relying on mechanisms proposed in prior work, such as lateral connections, recurrence, or top-down feedback. We identified two key properties needed for ImageNet pre-trained, feed-forward models to yield human-like contour integration: first, progressively increasing receptive field structure served as a critical architectural motif to support this capacity; and second, biased fine-tuning for contour-detection specifically for gradual curves (~20 degrees) resulted in human-like sensitivity to curvature. We further demonstrate that fine-tuning ImageNet pretrained models uncovers other hidden human-like capacities in feed-forward networks, including uncrowding (reduced interference from distractors as the number of distractors increases), which is considered a signature of human perceptual grouping. Thus, taken together these results provide a computational existence proof that purely feedforward hierarchical computations are capable of implementing gestalt “good continuation” and perceptual organization needed for human-like contour-integration and uncrowding. More broadly, these results raise the possibility that in human vision, later stages of processing play a more prominent role in perceptual-organization than implied by theories focused on recurrence and early lateral connections. 
    more » « less
    Free, publicly-accessible full text available August 18, 2026