skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?
Despite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.  more » « less
Award ID(s):
1949634
PAR ID:
10472541
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
Findings of EMNLP 2023
Date Published:
Journal Name:
Findings of Empirical Methods in Natural Language Processing
Format(s):
Medium: X
Location:
Singapore
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce a novel vision-and-language navigation (VLN) task of learning to provide real-time guidance to a blind follower situated in complex dynamic navigation scenarios. Towards exploring real-time information needs and fundamental challenges in our novel modeling task, we first collect a multi-modal real-world benchmark with in-situ Orientation and Mobility (O&M) instructional guidance. Subsequently, we leverage the real-world study to inform the design of a larger-scale simulation benchmark, thus enabling comprehensive analysis of limitations in current VLN models. Motivated by how sighted O&M guides seamlessly and safely support the awareness of individuals with visual impairments when collaborating on navigation tasks, we present ASSISTER, an imitation-learned agent that can embody such effective guidance. The proposed assistive VLN agent is conditioned on navigational goals and commands for generating instructional sentences that are coherent with the surrounding visual scene, while also carefully accounting for the immediate assistive navigation task. Altogether, our introduced evaluation and training framework takes a step towards scalable development of the next generation of seamless, human-like assistive agents. 
    more » « less
  2. Collaborative tasks often begin with partial task knowledge and incomplete initial plans from each partner. To complete these tasks, agents need to engage in situated communication with their partners and coordinate their partial plans towards a complete plan to achieve a joint task goal. While such collaboration seems effortless in a human-human team, it is highly challenging for human-AI collaboration. To address this limitation, this paper takes a step towards collaborative plan acquisition, where humans and agents strive to learn and communicate with each other to acquire a complete plan for joint tasks. Specifically, we formulate a novel problem for agents to predict the missing task knowledge for themselves and for their partners based on rich perceptual and dialogue history. We extend a situated dialogue benchmark for symmetric collaborative tasks in a 3D blocks world and investigate computational strategies for plan acquisition. Our empirical results suggest that predicting the partner's missing knowledge is a more viable approach than predicting one's own. We show that explicit modeling of the partner's dialogue moves and mental states produces improved and more stable results than without. These results provide insight for future AI agents that can predict what knowledge their partner is missing and, therefore, can proactively communicate such information to help their partner acquire such missing knowledge toward a common understanding of joint tasks. 
    more » « less
  3. Enabling efficient communication in artificial agents brings us closer to machines that can cooperate with each other and with human partners. Hand-engineered approaches have substantial limitations, leading to increased interest in methods for communication to emerge autonomously between artificial agents. Most of the research in the field explores unsituated communication in one-step referential tasks. The tasks are not temporally interactive and lack time pressures typically present in natural communication and language learning. In these settings, agents can successfully learn what to communicate but not when or whether to communicate. Here, we extend the literature by assessing emergence of communication between reinforcement learning agents in a temporally interactive, cooperative task of navigating a gridworld environment. We show that, through multi-step interactions, agents develop just-in-time messaging protocols that enable them to successfully solve the task. With memory—which provides flexibility around message timing—agent pairs converge to a look-ahead communication protocol, finding an optimal solution to the task more quickly than without memory. Lastly, we explore situated communication, enabling the acting agent to choose when and whether to communicate. With the opportunity cost of forgoing an action to communicate, the acting agent learns to solicit information sparingly, in line with the Gricean Maxim of quantity. Our results point towards the importance of studying language emergence through situated communication in multi-step interactions. 
    more » « less
  4. Understanding what sequence of steps are needed to complete a goal can help artificial intelligence systems reason about human activities. Past work in NLP has examined the task of goal-step inference for text. We introduce the visual analogue. We propose the Visual Goal-Step Inference (VGSI) task, where a model is given a textual goal and must choose which of four images represents a plausible step towards that goal. With a new dataset harvested from wikiHow consisting of 772,277 images representing human actions, we show that our task is challenging for state-of-the-art multimodal models. Moreover, the multimodal representation learned from our data can be effectively transferred to other datasets like HowTo100m, increasing the VGSI accuracy by 15 - 20%. Our task will facilitate multimodal reasoning about procedural events. 
    more » « less
  5. Despite the phenomenal advances in the computational power and functionality of electronic systems, human-machine interaction has largely been limited to simple control panels, keyboard, mouse and display. Consequently, these systems either rely critically on close human guidance or operate almost independently from the user. An exemplar technology integrated tightly into our lives is the smartphone. However, the term “smart” is a misnomer, since it has fundamentally no intelligence to understand its user. The users still have to type, touch or speak (to some extent) to express their intentions in a form accessible to the phone. Hence, intelligent decision making is still almost entirely a human task. A life-changing experience can be achieved by transforming machines from passive tools to agents capable of understanding human physiology and what their user wants [1]. This can advance human capabilities in unimagined ways by building a symbiotic relationship to solve real world problems cooperatively. One of the high-impact application areas of this approach is assistive internet of things (IoT) technologies for physically challenged individuals. The Annual World Report on Disability reveals that 15% of the world population lives with disability, while 110 to 190 million of these people have difficulty in functioning [1]. Quality of life for this population can improve significantly if we can provide accessibility to smart devices, which provide sensory inputs and assist with everyday tasks. This work demonstrates that smart IoT devices open up the possibility to alleviate the burden on the user by equipping everyday objects, such as a wheelchair, with decision-making capabilities. Moving part of the intelligent decision making to smart IoT objects requires a robust mechanism for human-machine communication (HMC). To address this challenge, we present examples of multimodal HMC mechanisms, where the modalities are electroencephalogram (EEG), speech commands, and motion sensing. We also introduce an IoT co-simulation framework developed using a network simulator (OMNeT++) and a robot simulation platform Virtual Robot Experimentation Platform (V-REP). We show how this framework is used to evaluate the effectiveness of different HMC strategies using automated indoor navigation as a driver application. 
    more » « less