Modality-Balanced Models for Visual Dialogue

Kim, Hyounghun; Tan, Hao; Bansal, Mohit

doi:10.1609/aaai.v34i05.6320

The Visual Dialog task requires a model to exploit both im- age and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large number of conversational questions can be answered by only looking at the image without any access to the context history, while others still need the conversa- tion context to predict the correct answers. We demonstrate that due to this reason, previous joint-modality (history and image) models over-rely on and are more prone to memoriz- ing the dialogue history (e.g., by extracting certain keywords or patterns in the context information), whereas image-only models are more generalizable (because they cannot memo- rize or extract keywords from history) and perform substan- tially better at the primary normalized discounted cumula- tive gain (NDCG) task metric which allows multiple correct answers. Hence, this observation encourages us to explic- itly maintain two models, i.e., an image-only model and an image-history joint model, and combine their complementary abilities for a more balanced multimodal model. We present multiple methods for this integration of the two models, via ensemble and consensus dropout fusion with shared param- eters. Empirically, our models achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and high bal- ance across metrics), and substantially outperform the winner of the Visual Dialog challenge 2018 on most metrics.

More Like this