NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

End-to-end Knowledge Retrieval with Multi-modal Queries

Luo, Man; Fang, Zhiyuan; Gokhale, Tejas; Yang, Yezhou; Baral, Chitta (July 2023, 61st Annual Meeting of the Association for Computational Linguistics)

We investigate knowledge retrieval with multi-modal queries, i.e. queries containing information split across image and text inputs, a challenging task that differs from previous work on cross-modal retrieval. We curate a new dataset called ReMuQ for benchmarking progress on this task. ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries. We introduce a retriever model “ReViz” that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion without being dependent on intermediate modules such as object detectors or caption generators. We introduce a new pretraining task that is effective for learning knowledge retrieval with multimodal queries and also improves performance on downstream tasks. We demonstrate superior performance in retrieval on two datasets (ReMuQ and OK-VQA) under zero-shot settings as well as further improvements when finetuned on these datasets.
more » « less
Full Text Available
CAVAN: Commonsense Knowledge Anchored Video Captioning

https://doi.org/10.1109/ICPR56361.2022.9956241

Shao, Huiliang; Fang, Zhiyuan; Yang, Yezhou (August 2022, 2022 26th International Conference on Pattern Recognition (ICPR))

Full Text Available
Injecting Semantic Concepts into End-to-End Image Captioning

https://doi.org/10.1109/CVPR52688.2022.01748

Fang, Zhiyuan; Wang, Jianfeng; Hu, Xiaowei; Liang, Lin; Gan, Zhe; Wang, Lijuan; Yang, Yezhou; Liu, Zicheng (June 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
Injecting Semantic Concepts Into End-to-End Image Captioning

Fang, Zhiyuan; Wang, Jianfeng; Hu, Xiaowei; Liang, Lin; Gan, Zhe; Wang, Lijuan; Yang, Yezhou and (January 2022, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Full Text Available
Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

https://doi.org/10.18653/v1/2020.emnlp-main.61

Fang, Zhiyuan; Gokhale, Tejas; Banerjee, Pratyay; Baral, Chitta; Yang, Yezhou (January 2020, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP))
null (Ed.)
Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent{'}s actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating \textit{commonsense} captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset {``}Video-to-Commonsense (V2C){''} that contains {\textasciitilde}9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.
more » « less
Full Text Available
Modularized Textual Grounding for Counterfactual Resilience

https://doi.org/10.1109/cvpr.2019.00654

Fang, Zhiyuan; Kong, Shu; Fowlkes, Charless; Yang, Yezhou (January 2019, 2019 {IEEE}/{CVF} Conference on Computer Vision and Pattern Recognition ({CVPR}))

Full Text Available

Search for: All records