- Home
- Search Results
- Page 1 of 1
Search for: All records
-
Total Resources1
- Resource Type
-
0001000000000000
- More
- Availability
-
01
- Author / Contributor
- Filter by Author / Creator
-
-
Alvarez, G A (1)
-
Doshi, F R (1)
-
Fel, T (1)
-
Konkle, T (1)
-
#Tyler Phillips, Kenneth E. (0)
-
#Willis, Ciara (0)
-
& Abreu-Ramos, E. D. (0)
-
& Abramson, C. I. (0)
-
& Abreu-Ramos, E. D. (0)
-
& Adams, S.G. (0)
-
& Ahmed, K. (0)
-
& Ahmed, Khadija. (0)
-
& Aina, D.K. Jr. (0)
-
& Akcil-Okan, O. (0)
-
& Akuom, D. (0)
-
& Aleven, V. (0)
-
& Andrews-Larson, C. (0)
-
& Archibald, J. (0)
-
& Arnett, N. (0)
-
& Arya, G. (0)
-
- Filter by Editor
-
-
& Spizer, S. M. (0)
-
& . Spizer, S. (0)
-
& Ahn, J. (0)
-
& Bateiha, S. (0)
-
& Bosch, N. (0)
-
& Brennan K. (0)
-
& Brennan, K. (0)
-
& Chen, B. (0)
-
& Chen, Bodong (0)
-
& Drown, S. (0)
-
& Ferretti, F. (0)
-
& Higgins, A. (0)
-
& J. Peters (0)
-
& Kali, Y. (0)
-
& Ruiz-Arias, P.M. (0)
-
& S. Spitzer (0)
-
& Sahin. I. (0)
-
& Spitzer, S. (0)
-
& Spitzer, S.M. (0)
-
(submitted - in Review for IEEE ICASSP-2024) (0)
-
-
Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Self-supervised Vision Transformers (ViTs) like DINOv2 show strong holistic shape processing capabilities, a feature linked to computations in their intermediate layers. However, the specific mechanism by which these layers transform local patch information into a global, configural percept remains a black box. To dis- sect this process, we conduct fine-grained mechanistic analyses by disentangling patch representations into their constituent content and positional information. We find that high-performing models demonstrate a distinct multi-stage processing signature: they first preserve the spatial localization of image content through many layers while concurrently refining their positional representations. Compu- tationally, we show that this is supported by a systematic "local-global handoff," where attention heads gradually shift to aggregating information using long-range interactions. In contrast, models with poor configural ability lose content-specific spatial information early and lack this critical positional refinement stage. This positional refinement is further stabilized by register tokens, which mitigate a common artifact in ViTs; repurpose low-information patch tokens into high-norm ’outliers’ to store global information, causing them to lose their local positional grounding. By isolating these high-norm activations in register tokens, the model better preserves the visual grounding of each patch, which we show also leads to a direct improvement in holistic processing. Overall, our findings suggest that holis- tic vision in ViTs arises not just from long-range attention, but from a structured pipeline that carefully manages the interplmore » « lessFree, publicly-accessible full text available December 2, 2026
An official website of the United States government
