skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments
For embodied agents, navigation is an important ability but not an isolated goal. Agents are also expected to perform specific tasks after reaching the target location, such as picking up objects and assembling them into a particular arrangement. We combine Vision-andLanguage Navigation, assembling of collected objects, and object referring expression comprehension, to create a novel joint navigation-and-assembly task, named ARRAMON. During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language (English) instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment. To support this task, we implement a 3D dynamic environment simulator and collect a dataset with human-written navigation and assembling instructions, and the corresponding ground truth trajectories. We also filter the collected instructions via a verification stage, leading to a total of 7.7K task instances (30.8K instructions and paths). We present results for several baseline models (integrated and biased) and metrics (nDTW, CTC, rPOD, and PTC), and the large model-human performance gap demonstrates that our task is challenging and presents a wide scope for future work.  more » « less
Award ID(s):
1840131
PAR ID:
10301946
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Findings of the Association for Computational Linguistics: EMNLP 2020
Page Range / eLocation ID:
3910 to 3927
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Most existing benchmarks for grounding language in interactive environments either lack realistic linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. We develop WebShop – a simulated e-commerce website environment with 1.18 million real-world products and 12,087 crowd-sourced text instructions. In this environment, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase a product given an instruction. WebShop provides several challenges including understanding compositional instructions, query (re-)formulation, dealing with noisy text in webpages, and performing strategic exploration. We collect over 1,600 human trajectories to first validate the benchmark, then train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of 29%, which significantly outperforms rule heuristics but is far lower than expert human performance (59%). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show our agent trained on WebShop exhibits non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of our benchmark for developing practical web agents that can operate in the wild. 
    more » « less
  2. null (Ed.)
    Many real-world tasks require agents to coordinate their behavior to achieve shared goals. Successful collaboration requires not only adopting the same communicative conventions, but also grounding these conventions in the same task-appropriate conceptual abstractions. We investigate how humans use natural language to collaboratively solve physical assembly problems more effectively over time. Human participants were paired up in an online environment to reconstruct scenes containing two block towers. One participant could see the target towers, and sent assembly instructions for the other participant to reconstruct. Participants provided increasingly concise instructions across repeated attempts on each pair of towers, using more abstract referring expressions that captured each scene's hierarchical structure. To explain these findings, we extend recent probabilistic models of ad hoc convention formation with an explicit perceptual learning mechanism. These results shed light on the inductive biases that enable intelligent agents to coordinate upon shared procedural abstractions. 
    more » « less
  3. null (Ed.)
    Interest in physical therapy and individual exercises such as yoga/dance has increased alongside the well-being trend, and people globally enjoy such exercises at home/office via video streaming platforms. However, such exercises are hard to follow without expert guidance. Even if experts can help, it is almost impossible to give personalized feedback to every trainee remotely. Thus, automated pose correction systems are required more than ever, and we introduce a new captioning dataset named FixMyPose to address this need. We collect natural language descriptions of correcting a “current” pose to look like a “target” pose. To support a multilingual setup, we collect descriptions in both English and Hindi. The collected descriptions have interesting linguistic properties such as egocentric relations to the environment objects, analogous references, etc., requiring an understanding of spatial relations and commonsense knowledge about postures. Further, to avoid ML biases, we maintain a balance across characters with diverse demographics, who perform a variety of movements in several interior environments (e.g., homes, offices). From our FixMyPose dataset, we introduce two tasks: the pose-correctional-captioning task and its reverse, the target-pose-retrieval task. During the correctional-captioning task, models must generate the descriptions of how to move from the current to the target pose image, whereas in the retrieval task, models should select the correct target pose given the initial pose and the correctional description. We present strong cross-attention baseline models (uni/multimodal, RL, multilingual) and also show that our baselines are competitive with other models when evaluated on other image-difference datasets. We also propose new task-specific metrics (object-match, body-part-match, direction-match) and conduct human evaluation for more reliable evaluation, and we demonstrate a large human-model performance gap suggesting room for promising future work. Finally, to verify the sim-to-real transfer of our FixMyPose dataset, we collect a set of real images and show promising performance on these images. Data and code are available: https://fixmypose-unc.github.io. 
    more » « less
  4. Abstract External representations powerfully support and augment complex human behavior. When navigating, people often consult external representations to help them find the way to go, but do maps or verbal instructions improve spatial knowledge or support effective wayfinding? Here, we examine spatial knowledge with and without external representations in two studies where participants learn a complex virtual environment. In the first study, we asked participants to generate their own maps or verbal instructions, partway through learning. We found no evidence of improved spatial knowledge in a pointing task requiring participants to infer the direction between two targets, either on the same route or on different routes, and no differences between groups in accurately recreating a map of the target landmarks. However, as a methodological note, pointing was correlated with the accuracy of the maps that participants drew. In the second study, participants had access to an accurate map or set of verbal instructions that they could study while learning the layout of target landmarks. Again, we found no evidence of differentially improved spatial knowledge in the pointing task, although we did find that the map group could recreate a map of the target landmarks more accurately. However, overall improvement was high. There was evidence that the nature of improvement across all conditions was specific to initial navigation ability levels. Our findings add to a mixed literature on the role of external representations for navigation and suggest that more substantial intervention—more scaffolding, explicit training, enhanced visualization, perhaps with personalized sequencing—may be necessary to improve navigation ability. 
    more » « less
  5. Humans often use natural language instructions to control and interact with robots for task execution. This poses a big challenge to robots that need to not only parse and understand human instructions but also realise semantic understanding of an unknown environment and its constituent elements. To address this challenge, this study presents a vision-language model (VLM)-driven approach to scene understanding of an unknown environment to enable robotic object manipulation. Given language instructions, a pretrained vision-language model built on open-sourced Llama2-chat (7B) as the language model backbone is adopted for image description and scene understanding, which translates visual information into text descriptions of the scene. Next, a zero-shot-based approach to fine-grained visual grounding and object detection is developed to extract and localise objects of interest from the scene task. Upon 3D reconstruction and pose estimate establishment of the object, a code-writing large language model (LLM) is adopted to generate high-level control codes and link language instructions with robot actions for downstream tasks. The performance of the developed approach is experimentally validated through table-top object manipulation by a robot. 
    more » « less