NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Learning Action-Effect Dynamics for Hypothetical Vision-Language Reasoning Task

Sampat, Shailaja; Banerjee, Pratyay; Yang, Yezhou; and Baral, Chitta. (December 2022, Findings of EMNLP 2022.)

Actions’ play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform ‘Reasoning about Actions & Change’ (RAC). This has been an important research direction in Artificial Intelligence (AI) in general, but the study of RAC with visual and linguistic inputs is relatively recent. The CLEVR_HYP is one such testbed for hypothetical vision-language reasoning with actions as the key focus. In this work, we propose a novel learning strategy that can improve reasoning about the effects of actions. We implement an encoder-decoder architecture to learn the representation of actions as vectors. We combine the aforementioned encoder-decoder architecture with existing modality parsers and a scene graph question answering model to evaluate our proposed system on the CLEVR_HYP dataset. We conduct thorough experiments to demonstrate the effectiveness of our proposed approach and discuss its advantages over previous baselines in terms of performance, data efficiency, and generalization capability.
more » « less
Full Text Available
CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

https://doi.org/10.18653/v1/2021.naacl-main.289

Sampat, Shailaja Keyur; Kumar, Akshay; Yang, Yezhou; Baral, Chitta (June 2021, Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies)
null (Ed.)
Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et. al., 2017) dataset. We then modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality. Our dataset setup scripts and codes will be made publicly available at https://github.com/shailaja183/clevr_hyp.
more » « less
Full Text Available
‘Just because you are right, doesn’t mean I am wrong’: Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasks

https://doi.org/10.18653/v1/2021.eacl-main.240

Luo, Man; Sampat, Shailaja Keyur; Tallman, Riley; Zeng, Yankai; Vancha, Manuha; Sajja, Akarshan; Baral, Chitta (January 2021, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume)
Merlo, Paola; Tiedemann, Jorg; Tsarfaty, Reut (Ed.)
GQA (CITATION) is a dataset for real-world visual reasoning and compositional question answering. We found that many answers predicted by the best vision-language models on the GQA dataset do not match the ground-truth answer but still are semantically meaningful and correct in the given context. In fact, this is the case with most existing visual question answering (VQA) datasets where they assume only one ground-truth answer for each question. We propose Alternative Answer Sets (AAS) of ground-truth answers to address this limitation, which is created automatically using off-the-shelf NLP tools. We introduce a semantic metric based on AAS and modify top VQA solvers to support multiple plausible answers for a question. We implement this approach on the GQA dataset and show the performance improvements.
more » « less
Full Text Available
A Model-Based Approach to Visual Reasoning on CNLVR Dataset

Sampat, Shailaja; Lee, Joohyung (October 2018, Proceedings of the 16th International Conference on Principles of Knowledge Representation and Reasoning)
Thielscher, Michael; Toni, Francesca; Wolter, Frank (Ed.)
Full Text Available
Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks

https://doi.org/10.18653/v1/2022.emnlp-main.340

Wang, Yizhong; Mishra, Swaroop; Alipoormolabashi, Pegah; Kordi, Yeganeh; Mirzaei, Amirreza; Naik, Atharva; Ashok, Arjun; Dhanasekaran, Arut Selvan; Arunkumar, Anjana; Stap, David; et al (January 2022, EMNLP)

Full Text Available

Search for: All records