skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Pose Generation for Social Robots in Conversational Group Formations
We study two approaches for predicting an appropriate pose for a robot to take part in group formations typical of social human conversations subject to the physical layout of the surrounding environment. One method is model-based and explicitly encodes key geometric aspects of conversational formations. The other method is data-driven. It implicitly models key properties of spatial arrangements using graph neural networks and an adversarial training regimen. We evaluate the proposed approaches through quantitative metrics designed for this problem domain and via a human experiment. Our results suggest that the proposed methods are effective at reasoning about the environment layout and conversational group formations. They can also be used repeatedly to simulate conversational spatial arrangements despite being designed to output a single pose at a time. However, the methods showed different strengths. For example, the geometric approach was more successful at avoiding poses generated in nonfree areas of the environment, but the data-driven method was better at capturing the variability of conversational spatial formations. We discuss ways to address open challenges for the pose generation problem and other interesting avenues for future work.  more » « less
Award ID(s):
1924802
PAR ID:
10334242
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Frontiers in Robotics and AI
Volume:
8
ISSN:
2296-9144
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Existing multimodal-based human action recognition approaches are computationally intensive, limiting their deployment in real-time applications. In this work, we present a novel and efficient pose-driven attention-guided multimodal network (EPAM-Net) for action recognition in videos. Specifically, we propose eXpand temporal Shift (X-ShiftNet) convolutional architectures for RGB and pose streams to capture spatio-temporal features from RGB videos and their skeleton sequences. The X-ShiftNet tackles the high computational cost of the 3D CNNs by integrating the Temporal Shift Module (TSM) into an efficient 2D CNN, enabling efficient spatiotemporal learning. Then skeleton features are utilized to guide the visual network stream, focusing on keyframes and their salient spatial regions using the proposed spatial–temporal attention block. Finally, the predictions of the two streams are fused for final classification. The experimental results show that our method, with a significant reduction in floating-point operations (FLOPs), outperforms and competes with the state-of-the-art methods on NTU RGB-D 60, NTU RGB-D 120, PKU-MMD, and Toyota SmartHome datasets. The proposed EPAM-Net provides up to a 72.8x reduction in FLOPs and up to a 48.6x reduction in the number of network parameters. The code will be available at https://github.com/ahmed-nady/Multimodal-ActionRecognition. 
    more » « less
  2. null (Ed.)
    The 3D world limits the human body pose and the hu- man body pose conveys information about the surrounding objects. Indeed, from a single image of a person placed in an indoor scene, we as humans are adept at resolving am- biguities of the human pose and room layout through our knowledge of the physical laws and prior perception of the plausible object and human poses. However, few computer vision models fully leverage this fact. In this work, we pro- pose a holistically trainable model that perceives the 3D scene from a single RGB image, estimates the camera pose and the room layout, and reconstructs both human body and object meshes. By imposing a set of comprehensive and sophisticated losses on all aspects of the estimations, we show that our model outperforms existing human body mesh methods and indoor scene reconstruction methods. To the best of our knowledge, this is the first model that outputs both object and human predictions at the mesh level, and performs joint optimization on the scene and human poses. 
    more » « less
  3. Abstract The coordinated motion of animal groups through fluids is thought to reduce the cost of locomotion to individuals in the group. However, the connection between the spatial patterns observed in collectively moving animals and the energetic benefits at each position within the group remains unclear. To address this knowledge gap, we study the spontaneous emergence of cohesive formations in groups of fish, modeled as flapping foils, all heading in the same direction. We show in pairwise formations and with increasing group size that (1) in side-by-side arrangements, the reciprocal nature of flow coupling results in an equal distribution of energy re-quirements among all members, with reduction in cost of locomotion for swimmers flapping inphase but an increase in cost for swimmers flapping antiphase, and (2) in inline arrangements, flow coupling is non-reciprocal for all flapping phase, with energetic savings in favor of trailing swimmers, but only up to a finite number of swimmers, beyond which school cohesion and energetic benefits are lost at once. We explain these findings mechanistically and we provide efficient diagnostic tools for identifying locations in the wake of single and multiple swimmers that offer op-portunities for hydrodynamic benefits to aspiring followers. Our results imply a connection between the resources generated by flow physics and social traits that influence greedy and cooperative group behavior. 
    more » « less
  4. In this paper, we propose a new deep learning-based method for estimating room layout given a pair of 360◦ panoramas. Our system, called Position-aware Stereo Merging Network or PSMNet, is an end-to-end joint layout-pose estimator. PSMNet consists of a Stereo Pano Pose (SP2) transformer and a novel Cross-Perspective Projection (CP2) layer. The stereo-view SP2 transformer is used to implicitly infer correspondences between views, and can handle noisy poses. The pose-aware CP2 layer is designed to render features from the adjacent view to the anchor (reference) view, in order to perform view fusion and estimate the visible layout. Our experiments and analysis validate our method, which significantly outperforms the state-of-the-art layout estimators, especially for large and complex room spaces. 
    more » « less
  5. Due to sensitive layout-dependent effects and varied performance metrics, analog routing automation for performance-driven layout synthesis is difficult to generalize. Existing research has proposed a number of heuristic layout constraints targeting specific performance metrics. However, previous frameworks fail to automatically combine routing with human intelligence. This paper proposes a novel, fully automated, analog routing paradigm that leverages machine learning to provide routing guidance, mimicking the sophisticated manual layout approaches. Experiments show that the proposed methodology obtains significant improvements over existing techniques and achieves competitive performance to manual layouts while being capable of generalizing to circuits of different functionality. 
    more » « less