skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: IHSR: A Framework Enables Robots to Learn Novel Hand Signals From a Few Samples
This project introduces a framework to enable robots to recognize human hand signals, a reliable and feasible device-free means of communication in many noisy environments such as construction sites and airport ramps, to facilitate efficient human-robot collaboration. Various hand signal systems are accepted in many small groups for specific purposes, such as Marshalling on airport ramps and construction site crane operations. Robots must be robust to unpredictable conditions, including various backgrounds and human appearances, an extreme challenge imposed by open environments. To address these challenges, we propose Instant Hand Signal Recognition (IHSR), a learning-based framework with world knowledge of human gestures embedded, for robots to learn novel hand signals in a few samples. It also offers robust zero-shot generalization to recognize learned signals in novel scenarios. Extensive experiments show that our IHSR can learn a novel hand signal in only 50 samples, which is 30+ times more efficient than the state-of-the-art method. It also demonstrates a robust zero-shot generalization for deploying a learned model in unseen environments to recognize hand signals from unseen human users.  more » « less
Award ID(s):
2245607
PAR ID:
10598004
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
ISSN:
2159-6255
ISBN:
979-8-3503-5536-9
Page Range / eLocation ID:
959 to 965
Subject(s) / Keyword(s):
Zero-shot Generalization, Gesture Recognition, Hand signals, Cross-modality Embedding
Format(s):
Medium: X
Location:
Boston, MA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. The fast development in Deep Learning (DL) has made it a promising technique for various autonomous robotic systems. Recently, researchers have explored deploying DL models, such as Reinforcement Learning and Imitation Learning, to enable robots for Radio-frequency Identification (RFID) based inventory tasks. However, the existing methods are either focused on a single field or need tremendous data and time to train. To address these problems, this paper presents a Cross-Modal Reasoning Model (CMRM), which is designed to extract high-dimension information from multiple sensors and learn to reason from spatial and historical features for latent crossmodal relations. Furthermore, CMRM aligns the learned tasking policy to high-level features to offer zero-shot generalization to unseen environments. We conduct extensive experiments in several virtual environments as well as in indoor settings with robots for RFID inventory. The experimental results demonstrate that the proposed CMRM can significantly improve learning efficiency by around 20 times. It also demonstrates a robust zero-shot generalization for deploying a learned policy in unseen environments to perform RFID inventory tasks successfully. 
    more » « less
  2. Hand signals are the most widely used, feasible, and device-free communication method in manufacturing plants, airport ramps, and other noisy or voice-prohibiting environments. Enabling IoT agents, such as robots, to recognize and communicate by hand signals will facilitate human-machine collaboration for the emerging “Industry 5.0.” While many prior works succeed in hand signal recognition, few can rigorously guarantee the accuracy of their predictions. This project proposes a method that builds on the theory of conformal prediction (CP) to provide statistical guarantees on hand signal recognition accuracy and, based on it, measure the uncertainty in this communication process. It utilizes a calibration set with a few representative samples to ensure that trained models provide a conformal prediction set that reaches or exceeds the truth worth and trustworthiness at a user-specified level. Subsequently, the uncertainty in the recognition process can be detected by measuring the length of the conformal prediction set. Furthermore, the proposed CP-based method can be used with IoT models without fine-tuning as an out-of-the-box and promising lightweight approach to modeling uncertainty. Our experiments show that the proposed conformal recognition method can achieve accurate hand signal prediction in novel scenarios. When selecting an error level α = 0.10, it provided 100% accuracy for out-of-distribution test sets. 
    more » « less
  3. Long-horizon tasks in unstructured environments are notoriously challenging for robots because they require the prediction of extensive action plans with thousands of steps while adapting to ever-changing conditions by reasoning among multimodal sensing spaces. Humans can efficiently tackle such compound problems by breaking them down into easily reachable abstract sub-goals, significantly reducing complexity. Inspired by this ability, we explore how we can enable robots to acquire sub-goal formulation skills for long-horizon tasks and generalize them to novel situations and environments. To address these challenges, we propose the Zero-shot Abstract Sub-goal Framework (ZAS-F), which empowers robots to decompose overarching action plans into transferable abstract sub-goals, thereby providing zero-shot capability in new task conditions. ZAS-F is an imitation-learning-based method that efficiently learns a task policy from a few demonstrations. The learned policy extracts abstract features from multimodal and extensive temporal observations and subsequently uses these features to predict task-agnostic sub-goals by reasoning about their latent relations. We evaluated ZAS-F in radio frequency identification (RFID) inventory tasks across various dynamic environments, a typical long-horizon task requiring robots to handle unpredictable conditions, including unseen objects and structural layouts. Ourexperiments demonstrated that ZAS-F achieves a learning efficiency 30 times higher than previous methods, requiring only 8k demonstrations. Compared to prior approaches, ZAS-F achieves a 98.3% scanning accuracy while significantly reducing the training data requirement. Further, ZAS-F demonstrated strong generalization, maintaining a scan success rate of 99.4% in real-world deployment without additional finetuning. In long-term operations spanning 100 rooms, ZAS-F maintained consistent performance compared to short-term tasks, highlighting its robustness against compounding errors. These results establish ZAS-F as an efficient and adaptable solution for long-horizon robotic tasks in unstructured environments. 
    more » « less
  4. Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions. 
    more » « less
  5. Few-shot classification aims to recognize novel categories with only few labeled images in each class. Existing metric-based few-shot classification algorithms predict categories by comparing the feature embeddings of query images with those from a few labeled images (support examples) using a learned metric function. While promising performance has been demonstrated, these methods often fail to generalize to unseen domains due to large discrepancy of the feature distribution across domains. In this work, we address the problem of few-shot classification under domain shifts for metric-based methods. Our core idea is to use feature-wise transformation layers for augmenting the image features using affine transforms to simulate various feature distributions under different domains in the training stage. To capture variations of the feature distributions under different domains, we further apply a learning-to-learn approach to search for the hyper-parameters of the feature-wise transformation layers. We conduct extensive experiments and ablation studies under the domain generalization setting using five few-shot classification datasets: mini-ImageNet, CUB, Cars, Places, and Plantae. Experimental results demonstrate that the proposed feature-wise transformation layer is applicable to various metric-based models, and provides consistent improvements on the few-shot classification performance under domain shift. 
    more » « less