skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 2, 2026

Title: Visual Anagrams Reveal Hidden Differences in Holistic Shape Processing Across Vision Models
Humans are able to recognize objects based on both local texture cues and the configuration of object parts, yet contemporary vision models primarily harvest local texture cues, yielding brittle, non-compositional features. Work on shape-vs- texture bias has pitted shape and texture representations in opposition, measuring shape relative to texture, ignoring the possibility that models (and humans) can simultaneously rely on both types of cues, and obscuring the absolute quality of both types of representation. We therefore recast shape evaluation as a matter of absolute configural competence, operationalized by the Configural Shape Score (CSS), which (i) measures the ability to recognize both images in Object-Anagram pairs that preserve local texture while permuting global part arrangement to depict different object categories. Across 86 convolutional, transformer, and hybrid models, CSS (ii) uncovers a broad spectrum of configural sensitivity with fully self- supervised and language-aligned transformers – exemplified by DINOv2, SigLIP2 and EVA-CLIP – occupying the top end of the CSS spectrum. Mechanistic probes reveal that (iii) high-CSS networks depend on long-range interactions: radius- controlled attention masks abolish performance showing a distinctive U-shaped integration profile, and representational-similarity analyses expose a mid-depth transition from local to global coding. A BagNet control, whose receptive fields straddle patch seams, remains at chance (iv), ruling out any “border-hacking” strategies. Finally, (v) we show that configural shape score also predicts other shape- dependent evals (e.g.,foreground bias, spectral and noise robustness). Overall, we propose that the path toward truly robust, generalizable, and human-like vision systems may not lie in forcing an artificial choice between shape and texture, but rather in architectural and learning frameworks that seamlessly integrate both local-texture and global configural shape  more » « less
Award ID(s):
1946308
PAR ID:
10654779
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Advances in Neural Information Processing Systems 38 (NeurIPS 2025)
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Vision can provide useful cues about the geometric properties of an object, like its size, distance, pose, and shape. But how the brain merges these properties into a complete sensory representation of a three-dimensional object is poorly understood. To address this gap, we investigated a visual illusion in which humans misperceive the shape of an object due to a small change in one eye’s retinal image. We first show that this illusion affects percepts of a highly familiar object under completely natural viewing conditions. Specifically, people perceived their own rectangular mobile phone to have a trapezoidal shape. We then investigate the perceptual underpinnings of this illusion by asking people to report both the perceived shape and pose of controlled stimuli. Our results suggest that the shape illusion results from distorted cues to object pose. In addition to yielding insights into object perception, this work informs our understanding of how the brain combines information from multiple visual cues in natural settings. The shape illusion can occur when people wear everyday prescription spectacles; thus, these findings also provide insight into the cue combination challenges that some spectacle wearers experience on a regular basis. 
    more » « less
  2. Self-supervised Vision Transformers (ViTs) like DINOv2 show strong holistic shape processing capabilities, a feature linked to computations in their intermediate layers. However, the specific mechanism by which these layers transform local patch information into a global, configural percept remains a black box. To dis- sect this process, we conduct fine-grained mechanistic analyses by disentangling patch representations into their constituent content and positional information. We find that high-performing models demonstrate a distinct multi-stage processing signature: they first preserve the spatial localization of image content through many layers while concurrently refining their positional representations. Compu- tationally, we show that this is supported by a systematic "local-global handoff," where attention heads gradually shift to aggregating information using long-range interactions. In contrast, models with poor configural ability lose content-specific spatial information early and lack this critical positional refinement stage. This positional refinement is further stabilized by register tokens, which mitigate a common artifact in ViTs; repurpose low-information patch tokens into high-norm ’outliers’ to store global information, causing them to lose their local positional grounding. By isolating these high-norm activations in register tokens, the model better preserves the visual grounding of each patch, which we show also leads to a direct improvement in holistic processing. Overall, our findings suggest that holis- tic vision in ViTs arises not just from long-range attention, but from a structured pipeline that carefully manages the interpl 
    more » « less
  3. null (Ed.)
    Abstract Objects differ from one another along a multitude of visual features. The more distinct an object is from other objects in its surroundings, the easier it is to find it. However, it is still unknown how this distinctiveness advantage emerges in human vision. Here, we studied how visual distinctiveness signals along two feature dimensions—shape and surface texture—combine to determine the overall distinctiveness of an object in the scene. Distinctiveness scores between a target object and distractors were measured separately for shape and texture using a search task. These scores were then used to predict search times when a target differed from distractors along both shape and texture. Model comparison showed that the overall object distinctiveness was best predicted when shape and texture combined using a Euclidian metric, confirming the brain is computing independent distinctiveness scores for shape and texture and combining them to direct attention. 
    more » « less
  4. Oh, A; Naumann, T; Globerson, A; Saenko, K; Hardt, M; Levine, S (Ed.)
    Current deep-learning models for object recognition are known to be heavily biased toward texture. In contrast, human visual systems are known to be biased toward shape and structure. What could be the design principles in human visual systems that led to this difference? How could we introduce more shape bias into the deep learning models? In this paper, we report that sparse coding, a ubiquitous principle in the brain, can in itself introduce shape bias into the network. We found that enforcing the sparse coding constraint using a non-differential Top-K operation can lead to the emergence of structural encoding in neurons in convolutional neural networks, resulting in a smooth decomposition of objects into parts and subparts and endowing the networks with shape bias. We demonstrated this emergence of shape bias and its functional benefits for different network structures with various datasets. For object recognition convolutional neural networks, the shape bias leads to greater robustness against style and pattern change distraction. For the image synthesis generative adversary networks, the emerged shape bias leads to more coherent and decomposable structures in the synthesized images. Ablation studies suggest that sparse codes tend to encode structures, whereas the more distributed codes tend to favor texture. Our code is host at the github repository: https://topk-shape-bias.github.io/ 
    more » « less
  5. State-of-the-art object recognition methods do not generalize well to unseen domains. Work in domain generalization has attempted to bridge domains by increasing feature compatibility, but has focused on standard, appearance-based representations. We show the potential of shape-based representations to increase domain robustness. We compare two types of shape-based representations: one trains a convolutional network over edge features, and another computes a soft, dense medial axis transform. We show the complementary strengths of these representations for different types of domains, and the effect of the amount of texture that is preserved. We show that our shape-based techniques better leverage data augmentations for domain generalization, and are more effective at texture bias mitigation than shape-inducing augmentations. Finally, we show that when the convolutional network in state-of-the-art domain generalization methods is replaced with one that explicitly captures shape, we obtain improved results. 
    more » « less