Pre-Training Multi-Modal Dense Retrievers for Outside-Knowledge Visual Question Answering

Salemi, Alireza; Rafiee, Mahta; Zamani, Hamed

Citation Details

This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OKVQA systems is to retrieve relevant documents for the given multimodal query. Current state-of-the-art dense retrieval model for this task uses an asymmetric architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios. more »

Award ID(s):: 2106282

PAR ID:: 10434902

Author(s) / Creator(s):: Salemi, Alireza; Rafiee, Mahta; Zamani, Hamed

Date Published:: 2023-07-23

Journal Name:: Proceedings of The 13th International Conference on the Theory of Information Retrieval (ICTIR 2023)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this