Language is compositional; an instruction can ex- press multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that gen- eralizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language- instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual- language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predi- cate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on es- tablished instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language- to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real- world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io. 
                        more » 
                        « less   
                    
                            
                            A Modular Vision Language Navigation and Manipulation Framework for Long Horizon Compositional Tasks in Indoor Environment
                        
                    
    
            In this paper we propose a new framework—MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from the traditional end-to-end techniques in this space and allows for a more tractable training process with separate vision and language data sets. Specifically, we propose a novel geometry-aware mapping technique for cluttered indoor environments, and a language understanding model generalized for household instruction following. We demonstrate a significant increase in success rates for long horizon, compositional tasks over recent works on the recently released benchmark data set -ALFRED. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1932505
- PAR ID:
- 10358025
- Date Published:
- Journal Name:
- Frontiers in Robotics and AI
- Volume:
- 9
- ISSN:
- 2296-9144
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Despite recent progress, learning new tasks through language instructions remains an extremely challenging problem. On the ALFRED benchmark for task learning, the published state-of-the-art system only achieves a task success rate of less than 10% in an unseen environment, compared to the human performance of over 90%. To address this issue, this paper takes a closer look at task learning. In a departure from a widely applied end-to-end architecture, we decomposed task learning into three sub-problems: sub-goal planning, scene navigation, and object manipulation; and developed a model HiTUT1 (stands for Hierarchical Tasks via Unified Transformers) that addresses each sub-problem in a unified manner to learn a hierarchical task structure. On the ALFRED benchmark, HiTUT has achieved the best performance with a remarkably higher generalization ability. In the unseen environment, HiTUT achieves over 160% performance gain in success rate compared to the previous state of the art. The explicit representation of task structures also enables an in-depth understanding of the nature of the problem and the ability of the agent, which provides insight for future benchmark development and evaluation.more » « less
- 
            null (Ed.)Enabling robots to learn tasks and follow instructions as easily as humans is important for many real-world robot applications. Previous approaches have applied machine learning to teach the mapping from language to low dimensional symbolic representations constructed by hand, using demonstration trajectories paired with accompanying instructions. These symbolic methods lead to data efficient learning. Other methods map language directly to high-dimensional control behavior, which requires less design effort but is data-intensive. We propose to first learning symbolic abstractions from demonstration data and then mapping language to those learned abstractions. These symbolic abstractions can be learned with significantly less data than end-to-end approaches, and support partial behavior specification via natural language since they permit planning using traditional planners. During training, our approach requires only a small number of demonstration trajectories paired with natural language—without the use of a simulator—and results in a representation capable of planning to fulfill natural language instructions specifying a goal or partial plan. We apply our approach to two domains, including a mobile manipulator, where a small number of demonstrations enable the robot to follow navigation commands like “Take left at the end of the hallway,” in environments it has not encountered before.more » « less
- 
            Enabling robots to learn tasks and follow instructions as easily as humans is important for many real-world robot applications. Previous approaches have applied machine learning to teach the mapping from language to low dimensional symbolic representations constructe by hand, using demonstration trajectories paired with accompanying instructions. These symbolic methods lead to data efficient learning. Other methods map language directly to high-dimensional control behavior, which requires less design effort but is data-intensive. We propose to first learning symbolic abstractions from demonstration data and then mapping language to those learned abstractions. These symbolic abstractions can be learned with significantly less data than end-to-end approaches, and support partial behavior specification via natural language since they permit planning using traditional planners. During training, our approach requires only a small number of demonstration trajectories paired with natural language—without the use of a simulator—and results in a representation capable of planning to fulfill natural language instructions specifying a goal or partial plan. We apply our approach to two domains, including a mobile manipulator, where a small number of demonstrations enable the robot to follow navigation commands like “Take left at the end of the hallway,” in environments it has not encountered before.more » « less
- 
            Humans excel at efficiently navigating through crowds without collision by focusing on specific visual regions relevant to navigation. However, most robotic visual navigation methods rely on deep learning models pre-trained on vision tasks, which prioritize salient objects—not necessarily relevant to navigation and potentially misleading. Alternative approaches train specialized navigation models from scratch, requiring significant computation. On the other hand, self-supervised learning has revolutionized computer vision and natural language processing, but its application to robotic navigation remains underexplored due to the difficulty of defining effective self-supervision signals. Motivated by these observations, in this work, we propose a Self-Supervised Vision-Action Model for Visual Navigation Pre-Training (VANP). Instead of detecting salient objects that are beneficial for tasks such as classification or detection, VANP learns to focus only on specific visual regions that are relevant to the navigation task. To achieve this, VANP uses a history of visual observations, future actions, and a goal image for self-supervision, and embeds them using two small Transformer Encoders. Then, VANP maximizes the information between the embeddings by using a mutual information maximization objective function. We demonstrate that most VANP-extracted features match with human navigation intuition. VANP achieves comparable performance as models learned end-to-end with half the training time and models trained on a large-scale, fully supervised dataset, i.e., ImageNet, with only 0.08% data.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    