Language is compositional; an instruction can ex- press multiple relation constraints to hold among objects in a scene that a robot is tasked to rearrange. Our focus in this work is an instructable scene-rearranging framework that gen- eralizes to longer instructions and to spatial concept compositions never seen at training time. We propose to represent language- instructed spatial concepts with energy functions over relative object arrangements. A language parser maps instructions to corresponding energy functions and an open-vocabulary visual- language model grounds their arguments to relevant objects in the scene. We generate goal scene configurations by gradient descent on the sum of energy functions, one per language predi- cate in the instruction. Local vision-based policies then re-locate objects to the inferred goal locations. We test our model on es- tablished instruction-guided manipulation benchmarks, as well as benchmarks of compositional instructions we introduce. We show our model can execute highly compositional instructions zero-shot in simulation and in the real world. It outperforms language- to-action reactive policies and Large Language Model planners by a large margin, especially for long instructions that involve compositions of multiple spatial concepts. Simulation and real- world robot execution videos, as well as our code and datasets are publicly available on our website: https://ebmplanner.github.io.
more »
« less
Learning of Complex-Structured Tasks from Verbal Instruction
This paper presents a novel approach to robot task learning from language-based instructions, which focuses on increasing the complexity of task representations that can be taught through verbal instruction. The major proposed contribution is the development of a framework for directly mapping a complex verbal instruction to an executable task representation, from a single training experience. The method can handle the following types of complexities: 1) instructions that use conjunctions to convey complex execution constraints (such as alternative paths of execution, sequential or nonordering constraints, as well as hierarchical representations) and 2) instructions that use prepositions and multiple adjectives to specify action/object parameters relevant for the task. Specific algorithms have been developed for handling conjunctions, adjectives and prepositions as well as for translating the parsed instructions into parameterized executable task representations. The paper describes validation experiments with a PR2 humanoid robot learning new tasks from verbal instruction, as well as an additional range of utterances that can be parsed into executable controllers by the proposed system.
more »
« less
- Award ID(s):
- 1757929
- PAR ID:
- 10211198
- Date Published:
- Journal Name:
- IEEERAS International Conference on Humanoid Robots
- ISSN:
- 2164-0572
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Vivi Nastase; Ellie Pavlick; Mohammad Taher Pilehvar; Jose Camacho-Collados; Alessandro Raganato (Ed.)This paper describes the evolution of the PropBank approach to semantic role labeling over the last two decades. During this time the PropBank frame files have been expanded to include non-verbal predicates such as adjectives, prepositions and multi-word expressions. The number of domains, genres and languages that have been PropBanked has also expanded greatly, creating an opportunity for much more challenging and robust testing of the generalization capabilities of PropBank semantic role labeling systems. We also describe the substantial effort that has gone into ensuring the consistency and reliability of the various annotated datasets and resources, to better support the training and evaluation of such systemsmore » « less
-
Abstract External representations powerfully support and augment complex human behavior. When navigating, people often consult external representations to help them find the way to go, but do maps or verbal instructions improve spatial knowledge or support effective wayfinding? Here, we examine spatial knowledge with and without external representations in two studies where participants learn a complex virtual environment. In the first study, we asked participants to generate their own maps or verbal instructions, partway through learning. We found no evidence of improved spatial knowledge in a pointing task requiring participants to infer the direction between two targets, either on the same route or on different routes, and no differences between groups in accurately recreating a map of the target landmarks. However, as a methodological note, pointing was correlated with the accuracy of the maps that participants drew. In the second study, participants had access to an accurate map or set of verbal instructions that they could study while learning the layout of target landmarks. Again, we found no evidence of differentially improved spatial knowledge in the pointing task, although we did find that the map group could recreate a map of the target landmarks more accurately. However, overall improvement was high. There was evidence that the nature of improvement across all conditions was specific to initial navigation ability levels. Our findings add to a mixed literature on the role of external representations for navigation and suggest that more substantial intervention—more scaffolding, explicit training, enhanced visualization, perhaps with personalized sequencing—may be necessary to improve navigation ability.more » « less
-
null (Ed.)Robotic systems typically follow a rigid approach to task execution, in which they perform the necessary steps in a specific order, but fail when having to cope with issues that arise during execution. We propose an approach that handles such cases through dialogue and human-robot collaboration. The proposed approach contributes a hierarchical control architecture that 1) autonomously detects and is cognizant of task execution failures, 2) initiates a dialogue with a human helper to obtain assistance, and 3) enables collaborative human-robot task execution through extended dialogue in order to 4) ensure robust execution of hierarchical tasks with complex constraints, such as sequential, non-ordering, and multiple paths of execution. The architecture ensures that the constraints are adhered to throughout the entire task execution, including during failures. The recovery of the architecture from issues during execution is validated by a human-robot team on a building task.more » « less
-
This paper addresses the problem of dynamic allocation of robot resources to tasks with hierarchical representations and multiple types of execution constraints, with the goal of enabling single-robot multitasking capabilities. Although the vast majority of robot platforms are equipped with more than one sensor (cameras, lasers, sonars) and several actuators (wheels/legs, two arms), which would in principle allow the robot to concurrently work on multiple tasks, existing methods are limited to allocating robots in their entirety to only one task at a time. This approach employs only a subset of a robot's sensors and actuators, leaving other robot resources unused. Our aim is to enable a robot to make full use of its capabilities by having an individual robot multitask, distributing its sensors and actuators to multiple concurrent activities. We propose a new architectural framework based on Hierarchical Task Trees that supports multitasking through a new representation of robot behaviors that explicitly encodes the robot resources (sensors and actuators) and the environmental conditions needed for execution. This architecture was validated on a two-arm, mobile, PR2 humanoid robot, performing tasks with multiple types of execution constraints.more » « less