skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Zero Experience Required: Plug & Play Modular Transfer Learning for Semantic Visual Navigation
In reinforcement learning for visual navigation, it is common to develop a model for each new task, and train that model from scratch with task-specific interactions in 3D environments. However, this process is expensive; massive amounts of interactions are needed for the model to generalize well. Moreover, this process is repeated when-ever there is a change in the task type or the goal modality. We present a unified approach to visual navigation using a novel modular transfer learning model. Our model can effectively leverage its experience from one source task and apply it to multiple target tasks (e.g., ObjectNav, Room-Nav, Vi ewNav) with various goal modalities (e.g., image, sketch, audio, label). Furthermore, our model enables zero-shot experience learning, whereby it can solve the target tasks without receiving any task-specific interactive training. Our experiments on multiple photorealistic datasets and challenging tasks show that our approach learns faster, generalizes better, and outperforms SoTA models by a significant margin.  more » « less
Award ID(s):
2120430
PAR ID:
10376660
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Page Range / eLocation ID:
17010 to 17020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Language understanding is essential for the navigation agent to follow instructions. We observe two kinds of issues in the instructions that can make the navigation task challenging: 1. The mentioned landmarks are not recognizable by the navigation agent due to the different vision abilities of the instructor and the modeled agent. 2. The mentioned landmarks are applicable to multiple targets, thus not distinctive for selecting the target among the candidate viewpoints. To deal with these issues, we design a translator module for the navigation agent to convert the original instructions into easy-to-follow sub-instruction representations at each step. The translator needs to focus on the recognizable and distinctive landmarks based on the agent’s visual abilities and the observed visual environment. To achieve this goal, we create a new synthetic sub-instruction dataset and design specific tasks to train the translator and the navigation agent. We evaluate our approach on Room2Room (R2R), Room4room (R4R), and Room2Room Last (R2R-Last) datasets and achieve state-of-the-art results on multiple benchmarks. 
    more » « less
  2. The advances in deep reinforcement learning re- cently revived interest in data-driven learning based approaches to navigation. In this paper we propose to learn viewpoint invariant and target invariant visual servoing for local mobile robot navigation; given an initial view and the goal view or an image of a target, we train deep convolutional network controller to reach the desired goal. We present a new architecture for this task which rests on the ability of establishing correspondences between the initial and goal view and novel reward structure motivated by the traditional feedback control error. The advantage of the proposed model is that it does not require calibration and depth information and achieves robust visual servoing in a variety of environments and targets without any parameter fine tuning. We present comprehensive evaluation of the approach and comparison with other deep learning architectures as well as classical visual servoing methods in visually realistic simulation environment [1]. The presented model overcomes the brittleness of classical visual servoing based methods and achieves significantly higher generalization capability compared to the previous learning approaches. 
    more » « less
  3. Few-shot knowledge graph (KG) completion task aims to perform inductive reasoning over the KG: given only a few support triplets of a new relation R (e.g., (chop, R, kitchen), (read, R, library)), the goal is to predict the query triplets of the same unseen relation R, e.g., (sleep, R, ?). Current approaches cast the problem in a meta-learning framework, where the model needs to be first jointly trained over many training few-shot tasks, each being defined by its own relation, so that learning/prediction on the target few-shot task can be effective. However, in real-world KGs, curating many training tasks is a challenging ad hoc process. We proposed Connection Subgraph Reasoner (CSR), which can make predictions for the target few-shot task directly without the need for pre-training on the human curated set of training tasks. The key to CSR is that we explicitly model a shared connection subgraph between support and query triplets, as inspired by the principle of eliminative induction. To adapt to specific KG, we design a corresponding self-supervised pretraining scheme with the objective of reconstructing automatically sampled connection subgraphs. Our pretrained model can then be directly applied to target few-shot tasks without the need for training few-shot tasks. Extensive experiments on real KGs, including NELL, FB15K-237, and ConceptNet, demonstrate the effectiveness of our framework: we have shown that even a learning-free implementation of CSR can already perform competitively to existing methods on target few-shot tasks; with pretraining, CSR can achieve significant gains of up to 52% on the more challenging inductive few-shot tasks where the entities are also unseen during (pre)training. 
    more » « less
  4. Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects—not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for tasks such as classification or detection, VANP learns to focus only on specific visual regions that are relevant to the navigation task. To achieve this, VANP uses a history of visual observations, future actions, and a goal image for self-supervision, and embeds them using two small Transformer Encoders. Then, VANP maximizes the information between the embeddings by using a mutual information maximization objective function. We demonstrate that most VANP-extracted features match with human navigation intuition. VANP achieves comparable performance as models learned end-to-end with half the training time and models trained on a large-scale, fully supervised dataset, i.e., ImageNet, with only 0.08% data. 
    more » « less
  5. Abstract Investigations into how individual neurons encode behavioral variables of interest have revealed specific representations in single neurons, such as place and object cells, as well as a wide range of cells with conjunctive encodings or mixed selectivity. However, as most experiments examine neural activity within individual tasks, it is currently unclear if and how neural representations change across different task contexts. Within this discussion, the medial temporal lobe is particularly salient, as it is known to be important for multiple behaviors including spatial navigation and memory, however the relationship between these functions is currently unclear. Here, to investigate how representations in single neurons vary across different task contexts in the medial temporal lobe, we collected and analyzed single‐neuron activity from human participants as they completed a paired‐task session consisting of a passive‐viewing visual working memory and a spatial navigation and memory task. Five patients contributed 22 paired‐task sessions, which were spike sorted together to allow for the same putative single neurons to be compared between the different tasks. Within each task, we replicated concept‐related activations in the working memory task, as well as target‐location and serial‐position responsive cells in the navigation task. When comparing neuronal activity between tasks, we first established that a significant number of neurons maintained the same kind of representation, responding to stimuli presentations across tasks. Further, we found cells that changed the nature of their representation across tasks, including a significant number of cells that were stimulus responsive in the working memory task that responded to serial position in the spatial task. Overall, our results support a flexible encoding of multiple, distinct aspects of different tasks by single neurons in the human medial temporal lobe, whereby some individual neurons change the nature of their feature coding between task contexts. 
    more » « less