Language is compositional; an instruction can ex- press multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that gen- eralizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language- instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual- language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predi- cate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on es- tablished instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language- to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real- world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io. 
                        more » 
                        « less   
                    
                            
                            Learning of Complex-Structured Tasks from Verbal Instruction
                        
                    
    
            This paper presents a novel approach to robot task learning from language-based instructions, which focuses on increasing the complexity of task representations that can be taught through verbal instruction. The major proposed contribution is the development of a framework for directly mapping a complex verbal instruction to an executable task representation, from a single training experience. The method can handle the following types of complexities: 1) instructions that use conjunctions to convey complex execution constraints (such as alternative paths of execution, sequential or nonordering constraints, as well as hierarchical representations) and 2) instructions that use prepositions and multiple adjectives to specify action/object parameters relevant for the task. Specific algorithms have been developed for handling conjunctions, adjectives and prepositions as well as for translating the parsed instructions into parameterized executable task representations. The paper describes validation experiments with a PR2 humanoid robot learning new tasks from verbal instruction, as well as an additional range of utterances that can be parsed into executable controllers by the proposed system. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1757929
- PAR ID:
- 10211198
- Date Published:
- Journal Name:
- IEEERAS International Conference on Humanoid Robots
- ISSN:
- 2164-0572
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Vivi Nastase; Ellie Pavlick; Mohammad Taher Pilehvar; Jose Camacho-Collados; Alessandro Raganato (Ed.)This paper describes the evolution of the PropBank approach to semantic role labeling over the last two decades. During this time the PropBank frame files have been expanded to include non-verbal predicates such as adjectives, prepositions and multi-word expressions. The number of domains, genres and languages that have been PropBanked has also expanded greatly, creating an opportunity for much more challenging and robust testing of the generalization capabilities of PropBank semantic role labeling systems. We also describe the substantial effort that has gone into ensuring the consistency and reliability of the various annotated datasets and resources, to better support the training and evaluation of such systemsmore » « less
- 
            Abstract External representations powerfully support and augment complex human behavior. When navigating, people often consult external representations to help them find the way to go, but do maps or verbal instructions improve spatial knowledge or support effective wayfinding? Here, we examine spatial knowledge with and without external representations in two studies where participants learn a complex virtual environment. In the first study, we asked participants to generate their own maps or verbal instructions, partway through learning. We found no evidence of improved spatial knowledge in a pointing task requiring participants to infer the direction between two targets, either on the same route or on different routes, and no differences between groups in accurately recreating a map of the target landmarks. However, as a methodological note, pointing was correlated with the accuracy of the maps that participants drew. In the second study, participants had access to an accurate map or set of verbal instructions that they could study while learning the layout of target landmarks. Again, we found no evidence of differentially improved spatial knowledge in the pointing task, although we did find that the map group could recreate a map of the target landmarks more accurately. However, overall improvement was high. There was evidence that the nature of improvement across all conditions was specific to initial navigation ability levels. Our findings add to a mixed literature on the role of external representations for navigation and suggest that more substantial intervention—more scaffolding, explicit training, enhanced visualization, perhaps with personalized sequencing—may be necessary to improve navigation ability.more » « less
- 
            null (Ed.)Robotic systems typically follow a rigid approach to task execution, in which they perform the necessary steps in a specific order, but fail when having to cope with issues that arise during execution. We propose an approach that handles such cases through dialogue and human-robot collaboration. The proposed approach contributes a hierarchical control architecture that 1) autonomously detects and is cognizant of task execution failures, 2) initiates a dialogue with a human helper to obtain assistance, and 3) enables collaborative human-robot task execution through extended dialogue in order to 4) ensure robust execution of hierarchical tasks with complex constraints, such as sequential, non-ordering, and multiple paths of execution. The architecture ensures that the constraints are adhered to throughout the entire task execution, including during failures. The recovery of the architecture from issues during execution is validated by a human-robot team on a building task.more » « less
- 
            null (Ed.)Nonverbal task learning is defined here as a variant of interactive task learning in which an agent learns the definition of a new task without any verbal information such as task instructions. Instead, the agent must 1) learn the task definition using only a single solved example problem as its training input, and then 2) generalize this definition in order to successfully parse new problems. In this paper, we present a conceptual framework for nonverbal task learning, and we compare and contrast this type of learning with existing learning paradigms in AI. We also discuss nonverbal task learning in the context of nonverbal human intelligence tests, which are standardized tests designed to be given without any verbal instructions so that they can be used by people with language difficulties.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    