CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Sampat, Shailaja Keyur; Kumar, Akshay; Yang, Yezhou; Baral, Chitta

doi:10.18653/v1/2021.naacl-main.289

Citation Details

CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images

Most existing research on visual question answering (VQA) is limited to information explicitly present in an image or a video. In this paper, we take visual understanding to a higher level where systems are challenged to answer questions that involve mentally simulating the hypothetical consequences of performing specific actions in a given scenario. Towards that end, we formulate a vision-language question answering task based on the CLEVR (Johnson et. al., 2017) dataset. We then modify the best existing VQA methods and propose baseline solvers for this task. Finally, we motivate the development of better vision-language models by providing insights about the capability of diverse architectures to perform joint reasoning over image-text modality. Our dataset setup scripts and codes will be made publicly available at https://github.com/shailaja183/clevr_hyp. more »

Award ID(s):: 1816039

PAR ID:: 10283028

Author(s) / Creator(s):: Sampat, Shailaja Keyur; Kumar, Akshay; Yang, Yezhou; Baral, Chitta

Date Published:: 2021-06-01

Journal Name:: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Page Range / eLocation ID:: 3692 to 3709

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.18653/v1/2021.naacl-main.289

More Like this