NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities.

Zhang, Z; Hu, F; Lee, L; Shi, F; Kordjamshidi, P; Chai, J; Ma, Z (April 2025, International Conference on Learning Representations (ICLR))

Free, publicly-accessible full text available April 30, 2026
Vision-Language Models Are Not Pragramtically Competent in Referring Expression Generation

Ma, Z; Ding, J; Zhang, X; Luo, D; Ding, J; Xu, S; Huang, Y; Peng, R; Chai, J (April 2025, Conference on Language Modeling (COLM))

Free, publicly-accessible full text available April 30, 2026
Multi-Object Hallucination in Vision-Language Models

Chen, X; Ma, Z; Zhang, X; Xu, S; Qian, S; Yang, J; Fouhey, D; Chai, J (December 2024, Conference on Neural Information Processing Systems (NeurIPS))

Full Text Available
Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use

Xi, J; He, Y; Dai, Y; Yang, J; Dai, Y; Chai, J (November 2024, Conference on Empirical Methods in Natural Language Processing (EMNLP))

Full Text Available
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Yang, J; Chen, X; Qian, S; Madaan, N; Iyengar, M; Fouhey, D; Chai, J (May 2024, IEEE International Conference on Robotics and Automation (ICRA))

Full Text Available
Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation.

Dai, Y; Peng, R; Li, S; Chai, J (May 2024, IEEE International Conference on Robotics and Automation (ICRA))

Full Text Available
Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?

Bao, Yuwei; Yu, Keunwoo Peter; Zhang, Yichi; Storks, Shane; Bar-Yossef, Itamar; De La Iglesia, Alexander; Su, Megan; Zheng, Xiaolin; Chai, Joyce (November 2023, Findings of Empirical Methods in Natural Language Processing)

Despite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.
more » « less
Full Text Available
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

Xu, Sihan; Ma, Ziqiao; Huang, Yidong; Lee, Honglak; Chai, Joyce (November 2023, Thirty-seventh Conference on Neural Information Processing Systems)

Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train.
more » « less
Full Text Available
Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Zhang, Yichi; Pan, Jiayi; Zhou, Yuchen; Pan, Rui; Chai, Joyce (November 2023, Proceedings of Conference of Empirical Methods in Natural Language Processing)

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world.
more » « less
Full Text Available
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models

Ma, Ziqiao; Sansom, Jacob; Peng, Run; Chai, Joyce (November 2023, Findings of Empirical Methods in Natural Language Processing)

Large Language Models (LLMs) have generated considerable interest and debate regarding their potential emergence of Theory of Mind (ToM). Several recent inquiries reveal a lack of robust ToM in these models and pose a pressing demand to develop new benchmarks, as current ones primarily focus on different aspects of ToM and are prone to shortcuts and data leakage. In this position paper, we seek to answer two road-blocking questions: (1) How can we taxonomize a holistic landscape of machine ToM? (2) What is a more effective evaluation protocol for machine ToM? Following psychological studies, we taxonomize machine ToM into 7 mental state categories and delineate existing benchmarks to identify under-explored aspects of ToM. We argue for a holistic and situated evaluation of ToM to break ToM into individual components and treat LLMs as an agent who is physically situated in environments and socially situated in interactions with humans. Such situated evaluation provides a more comprehensive assessment of mental states and potentially mitigates the risk of shortcuts and data leakage. We further present a pilot study in a grid world setup as a proof of concept. We hope this position paper can facilitate future research to integrate ToM with LLMs and offer an intuitive means for researchers to better position their work in the landscape of ToM.
more » « less
Full Text Available

« Prev Next »

Search for: All records