NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Multi-Object Hallucination in Vision Language Models

https://doi.org/10.52202/079017-1409

Chai, Joyce; Chen, Xuweiyi; Fouhey, David; Ma, Ziqiao; Qian, Shengyi; Xu, Sihan; Yang, Jianing; Zhang, Xuejun (December 2024, Neural Information Processing Systems Foundation, Inc. (NeurIPS))

Full Text Available
Think, Act, and Ask: Open-World Interactive Personalized Robot Navigation

https://doi.org/10.1109/ICRA57147.2024.10610178

Dai, Yinpei; Peng, Run; Li, Sikai; Chai, Joyce (May 2024, IEEE)

Full Text Available
Natural language instructions for intuitive human interaction with robotic assistants in field construction work

https://doi.org/10.1016/j.autcon.2024.105345

Park, Somin; Wang, Xi; Menassa, Carol C; Kamat, Vineet R; Chai, Joyce Y (May 2024, Automation in Construction)

Full Text Available
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

https://doi.org/10.1109/ICRA57147.2024.10610443

Yang, Jianing; Chen, Xuweiyi; Qian, Shengyi; Madaan, Nikhil; Iyengar, Madhavan; Fouhey, David F; Chai, Joyce (May 2024, IEEE)

Full Text Available
GIPCOL: Graph-Injected Soft Prompting for Compositional Zero-Shot Learning

https://doi.org/10.1109/WACV57701.2024.00567

Xu, Guangyue; Chai, Joyce; Kordjamshidi, Parisa (January 2024, IEEE)

Pre-trained vision-language models (VLMs) have achieved promising success in many fields, especially with prompt learning paradigm. In this work, we propose GIPCOL (Graph-Injected Soft Prompting for Compositional Learning) to better explore the compositional zero-shot learning (CZSL) ability of VLMs within the prompt-based learning framework. The soft prompt in GIPCOL is structured and consists of the prefix learnable vectors, attribute label and object label. In addition, the attribute and object labels in the soft prompt are designated as nodes in a compositional graph. The compositional graph is constructed based on the compositional structure of the objects and attributes extracted from the training data and consequently feeds the updated concept representation into the soft prompt to capture this compositional structure for a better prompting for CZSL. With the new prompting strategy, GIPCOL achieves state-of-the-art AUC results on all three CZSL benchmarks, including MIT-States, UT-Zappos, and C-GQA datasets in both closed and open settings compared to previous non-CLIP as well as CLIP-based methods. We analyze when and why GIPCOL operates well given the CLIP backbone and its training data limitations, and our findings shed light on designing more effective prompts for CZSL.
more » « less
Full Text Available
Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models

Ma, Ziqiao; Sansom, Jacob; Peng, Run; Chai, Joyce (November 2023, Findings of Empirical Methods in Natural Language Processing)

Large Language Models (LLMs) have generated considerable interest and debate regarding their potential emergence of Theory of Mind (ToM). Several recent inquiries reveal a lack of robust ToM in these models and pose a pressing demand to develop new benchmarks, as current ones primarily focus on different aspects of ToM and are prone to shortcuts and data leakage. In this position paper, we seek to answer two road-blocking questions: (1) How can we taxonomize a holistic landscape of machine ToM? (2) What is a more effective evaluation protocol for machine ToM? Following psychological studies, we taxonomize machine ToM into 7 mental state categories and delineate existing benchmarks to identify under-explored aspects of ToM. We argue for a holistic and situated evaluation of ToM to break ToM into individual components and treat LLMs as an agent who is physically situated in environments and socially situated in interactions with humans. Such situated evaluation provides a more comprehensive assessment of mental states and potentially mitigates the risk of shortcuts and data leakage. We further present a pilot study in a grid world setup as a proof of concept. We hope this position paper can facilitate future research to integrate ToM with LLMs and offer an intuitive means for researchers to better position their work in the landscape of ToM.
more » « less
Full Text Available
CycleNet: Rethinking Cycle Consistency in Text-Guided Diffusion for Image Manipulation

Xu, Sihan; Ma, Ziqiao; Huang, Yidong; Lee, Honglak; Chai, Joyce (November 2023, Thirty-seventh Conference on Neural Information Processing Systems)

Diffusion models (DMs) have enabled breakthroughs in image synthesis tasks but lack an intuitive interface for consistent image-to-image (I2I) translation. Various methods have been explored to address this issue, including mask-based methods, attention-based methods, and image-conditioning. However, it remains a critical challenge to enable unpaired I2I translation with pre-trained DMs while maintaining satisfying consistency. This paper introduces Cyclenet, a novel but simple method that incorporates cycle consistency into DMs to regularize image manipulation. We validate Cyclenet on unpaired I2I tasks of different granularities. Besides the scene and object level translation, we additionally contribute a multi-domain I2I translation dataset to study the physical state changes of objects. Our empirical studies show that Cyclenet is superior in translation consistency and quality, and can generate high-quality images for out-of-domain distributions with a simple change of the textual prompt. Cyclenet is a practical framework, which is robust even with very limited training data (around 2k) and requires minimal computational resources (1 GPU) to train.
more » « less
Full Text Available
Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Zhang, Yichi; Pan, Jiayi; Zhou, Yuchen; Pan, Rui; Chai, Joyce (November 2023, Proceedings of Conference of Empirical Methods in Natural Language Processing)

Vision-Language Models (VLMs) are trained on vast amounts of data captured by humans emulating our understanding of the world. However, known as visual illusions, human's perception of reality isn't always faithful to the physical world. This raises a key question: do VLMs have the similar kind of illusions as humans do, or do they faithfully learn to represent reality? To investigate this question, we build a dataset containing five types of visual illusions and formulate four tasks to examine visual illusions in state-of-the-art VLMs. Our findings have shown that although the overall alignment is low, larger models are closer to human perception and more susceptible to visual illusions. Our dataset and initial findings will promote a better understanding of visual illusions in humans and machines and provide a stepping stone for future computational models that can better align humans and machines in perceiving and communicating about the shared visual world.
more » « less
Full Text Available
World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

https://doi.org/10.18653/v1/2023.acl-long.31

Ma, Ziqiao; Pan, Jiayi; Chai, Joyce (July 2023, The 61th Annual Meeting of Association for Computational Linguistics)

The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings, and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly.
more » « less
Full Text Available
Human Inspired Progressive Alignment and Comparative Learning for Grounded Word Acquisition

https://doi.org/10.18653/v1/2023.acl-long.863

Bao, Yuwei; Lattimer, Barrett; Chai, Joyce (July 2023, The 61th Annual Meeting of the Association for Computational Linguistics)

Human language acquisition is an efficient, supervised, and continual process. In this work, we took inspiration from how human babies acquire their first language, and developed a computational process for word acquisition through comparative learning. Motivated by cognitive findings, we generated a small dataset that enables the computation models to compare the similarities and differences of various attributes, learn to filter out and extract the common information for each shared linguistic label. We frame the acquisition of words as not only the information filtration process, but also as representation-symbol mapping. This procedure does not involve a fixed vocabulary size, nor a discriminative objective, and allows the models to continually learn more concepts efficiently. Our results in controlled experiments have shown the potential of this approach for efficient continual learning of grounded words.
more » « less

« Prev Next »

Search for: All records