Cross-modal Map Learning for Vision and Language Navigation

Georgakis, Georgios; Schmeckpeper, Karl; Wanchoo, Karan; Dan, Soham; Miltsakaki, Eleni; Roth, Dan; Daniilidis, Kostas

doi:10.1109/CVPR52688.2022.01502

Citation Details

Cross-modal Map Learning for Vision and Language Navigation

We consider the problem of Vision-and-Language Navigation (VLN). The majority of current methods for VLN are trained end-to-end using either unstructured memory such as LSTM, or using cross-modal attention over the egocentric observations of the agent. In contrast to other works, our key insight is that the association between language and vision is stronger when it occurs in explicit spatial representations. In this work, we propose a cross-modal map learning model for vision-and-language navigation that first learns to predict the top-down semantics on an egocentric map for both observed and unobserved regions, and then predicts a path towards the goal as a set of way-points. In both cases, the prediction is informed by the language through cross-modal attention mechanisms. We experimentally test the basic hypothesis that language-driven navigation can be solved given a map, and then show competitive results on the full VLN-CE benchmark. more »

Award ID(s):: 2212433

PAR ID:: 10464992

Author(s) / Creator(s):: Georgakis, Georgios; Schmeckpeper, Karl; Wanchoo, Karan; Dan, Soham; Miltsakaki, Eleni; Roth, Dan; Daniilidis, Kostas

Date Published:: 2022-06-18

Journal Name:: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Page Range / eLocation ID:: 15439 to 15449

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/CVPR52688.2022.01502

More Like this