skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?
Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world.  more » « less
Award ID(s):
1949634
PAR ID:
10472532
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
EMNLP 2023
Date Published:
Journal Name:
Proceedings of Conference of Empirical Methods in Natural Language Processing
Format(s):
Medium: X
Location:
Singapore
Sponsoring Org:
National Science Foundation
More Like this
  1. Augmented Reality (AR) enhances the real world by integrating virtual content, yet ensuring the quality, usability, and safety of AR experiences presents significant challenges. Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? Could Vision-Language Models (VLMs) offer a solution for the automated evaluation of AR-generated scenes? In this study, we evaluate the capabilities of three state-of-the-art commercial VLMs -- GPT, Gemini, and Claude -- in identifying and describing AR scenes. For this purpose, we use DiverseAR, the first AR dataset specifically designed to assess VLMs' ability to analyze virtual content across a wide range of AR scene complexities. Our findings demonstrate that VLMs are generally capable of perceiving and describing AR scenes, achieving a True Positive Rate (TPR) of up to 93% for perception and 71% for description. While they excel at identifying obvious virtual objects, such as a glowing apple, they struggle when faced with seamlessly integrated content, such as a virtual pot with realistic shadows. Our results highlight both the strengths and the limitations of VLMs in understanding AR scenarios. We identify key factors affecting VLM performance, including virtual content placement, rendering quality, and physical plausibility. This study underscores the potential of VLMs as tools for evaluating the quality of AR experiences. 
    more » « less
  2. Errors in radiologic interpretation are largely the result of failures of perception. This remains true despite the increasing use of computer-aided detection and diagnosis. We surveyed the literature on visual illusions during the viewing of radiologic images. Misperception of anatomical structures is a potential cause of error that can lead to patient harm if disease is seen when none is present. However, visual illusions can also help enhance the ability of radiologists to detect and characterize abnormalities. Indeed, radiologists have learned to exploit certain perceptual biases in diagnostic findings and as training tools. We propose that further detailed study of radiologic illusions would help clarify the mechanisms underlying radiologic performance and provide additional heuristics to improve radiologist training and reduce medical error. 
    more » « less
  3. Che, Wanxiang; Nabende, Joyce; Shutova, Ekaterina; Pilehvar, Mohammad Taher (Ed.)
    Vision-Language Models (VLMs) have made rapid progress in reasoning across visual and textual data. While VLMs perform well on vision tasks that they are trained on, our results highlight key challenges in abstract pattern recognition. We present GlyphPattern, a 954 item dataset that pairs 318 human-written descriptions of visual patterns from 40 writing systems with three visual presentation styles.GlyphPattern evaluates abstract pattern recognition in VLMs, requiring models to understand and judge natural language descriptions of visual patterns. GlyphPattern patterns are drawn from a large-scale cognitive science investigation of human writing systems; as a result, they are rich in spatial reference and compositionality. Our experiments show that GlyphPattern is challenging for state-of-the-art VLMs (GPT-4o achieves only 55% accuracy), with marginal gains from few-shot prompting. Our detailed analysis reveals errors at multiple levels, including visual processing, natural language understanding, and pattern generalization. 
    more » « less
  4. Animals live in visually complex environments. As a result, visual systems have evolved mechanisms that simplify visual processing and allow animals to focus on the information that is most relevant to adaptive decision making. This review explores two key mechanisms that animals use to efficiently process visual information: categorization and specialization. Categorization occurs when an animal's perceptual system sorts continuously varying stimuli into a set of discrete categories. Specialization occurs when particular classes of stimuli are processed using distinct cognitive operations that are not used for other classes of stimuli. We also describe a nonadaptive consequence of simplifying heuristics: visual illusions, where visual perception consistently misleads the viewer about the state of the external world or objects within it. We take an explicitly comparative approach by exploring similarities and differences in visual cognition across human and nonhuman taxa. Considering areas of convergence and divergence across taxa provides insight into the evolution and function of visual systems and associated perceptual strategies. 
    more » « less
  5. Abstract The increasing integration of Visual Language Models (VLMs) into visualization systems demands a comprehensive understanding of their visual interpretation capabilities and constraints. While existing research has examined individual models, systematic comparisons of VLMs' visualization literacy remain unexplored. We bridge this gap through a rigorous, first‐of‐its‐kind evaluation of four leading VLMs (GPT‐4, Claude, Gemini, and Llama) using standardized assessments: the Visualization Literacy Assessment Test (VLAT) and Critical Thinking Assessment for Literacy in Visualizations (CALVI). Our methodology uniquely combines randomized trials with structured prompting techniques to control for order effects and response variability ‐ a critical consideration overlooked in many VLM evaluations. Our analysis reveals that while specific models demonstrate competence in basic chart interpretation (Claude achieving 67.9% accuracy on VLAT), all models exhibit substantial difficulties in identifying misleading visualization elements (maximum 30.0% accuracy on CALVI). We uncover distinct performance patterns: strong capabilities in interpreting conventional charts like line charts (76‐96% accuracy) and detecting hierarchical structures (80‐100% accuracy), but consistent difficulties with data‐dense visualizations involving multiple encodings (bubble charts: 18.6‐61.4%) and anomaly detection (25‐30% accuracy). Significantly, we observe distinct uncertainty management behavior across models, with Gemini displaying heightened caution (22.5% question omission) compared to others (7‐8%). These findings provide crucial insights for the visualization community by establishing reliable VLM evaluation benchmarks, identifying areas where current models fall short, and highlighting the need for targeted improvements in VLM architectures for visualization tasks. To promote reproducibility, encourage further research, and facilitate benchmarking of future VLMs, our complete evaluation framework, including code, prompts, and analysis scripts, is available athttps://github.com/washuvis/VisLit‐VLM‐Eval. 
    more » « less