skip to main content


Title: EgoGlass: Egocentric-View Human Pose Estimation From an Eyeglass Frame
We present a new approach, EgoGlass, towards egocentric motion-capture and human pose estimation. EgoGlass is a lightweight eyeglass frame with two cameras mounted on it. Our first contribution is a new egocentric motion-capture device that adds next to no extra burden on the user and a dataset of real people doing a diverse set of actions captured by EgoGlass. Second, we propose to utilize body part information for human pose detection - to help tackle the problems of limited body coverage and self-occlusions caused by the egocentric viewpoint and cameras’ proximity to the human body. We also propose a concept of pseudo-limb mask as an alternative for segmentation mask when ground truth segmentation mask is absent for egocentric images with real subject. We demonstrate that our method achieves better results than the counterpart method without body part information on our dataset. We also test our method on two existing egocentric datasets: xR-EgoPose and EgoCap. Our method achieves state-of-the-art results on xR-EgoPose and is on par with existing method for EgoCap without requiring temporal information or personalization for each individual user.  more » « less
Award ID(s):
1840131
PAR ID:
10390887
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2021 International Conference on 3D Vision (3DV)
Page Range / eLocation ID:
32 to 41
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We envision a convenient telepresence system available to users anywhere, anytime. Such a system requires displays and sensors embedded in commonly worn items such as eyeglasses, wristwatches, and shoes. To that end, we present a standalone real-time system for the dynamic 3D capture of a person, relying only on cameras embedded into a head-worn device, and on Inertial Measurement Units (IMUs) worn on the wrists and ankles. Our prototype system egocentrically reconstructs the wearer's motion via learning-based pose estimation, which fuses inputs from visual and inertial sensors that complement each other, overcoming challenges such as inconsistent limb visibility in head-worn views, as well as pose ambiguity from sparse IMUs. The estimated pose is continuously re-targeted to a prescanned surface model, resulting in a high-fidelity 3D reconstruction. We demonstrate our system by reconstructing various human body movements and show that our visual-inertial learning-based method, which runs in real time, outperforms both visual-only and inertial-only approaches. We captured an egocentric visual-inertial 3D human pose dataset publicly available at https://sites.google.com/site/youngwooncha/egovip for training and evaluating similar methods. 
    more » « less
  2. Weakly supervised pose estimation can be used to assist unsupervised body part segmentation and concealed item detection. The accuracy of pose estimation is essential for precise body part segmentation and accurate concealed item detection. In this paper, we show how poses obtained from an RGB pretrained 2D pose detector can be modified for the backscatter image domain. The 2D poses are refined using RANSAC bundle adjustment to minimize the projection loss in 3D. Furthermore, we show how 2D poses can be optimized using a newly proposed 3D-to-2D pose correction network weakly supervised with pose prior regularizers and multi-view pose and posture consistency losses. The optimized 2D poses are used to segment human body parts. We then train a body-part-aware anomaly detection network to detect foreign (concealed threat) objects on segmented body parts. Our work is applied to the TSA passenger screening dataset containing millimeter wave scan images of airport travelers annotated with only binary labels that indicate whether a foreign object is concealed on a body part. Our proposed approach significantly improves the detection accuracy of TSA 2D backscatter images in existing works with a state-of-the-art performance of 97% F1-score, 0.0559 log-loss on the TSA-PSD test-set, and a 74% reduction in 2D pose error. 
    more » « less
  3. null (Ed.)
    First-person-view videos of hands interacting with tools are widely used in the computer vision industry. However, creating a dataset with pixel-wise segmentation of hands is challenging since most videos are captured with fingertips occluded by the hand dorsum and grasped tools. Current methods often rely on manually segmenting hands to create annotations, which is inefficient and costly. To relieve this challenge, we create a method that utilizes thermal information of hands for efficient pixel-wise hand segmentation to create a multi-modal activity video dataset. Our method is not affected by fingertip and joint occlusions and does not require hand pose ground truth. We show our method to be 24 times faster than the traditional polygon labeling method while maintaining high quality. With the segmentation method, we propose a multi-modal hand activity video dataset with 790 sequences and 401,765 frames of "hands using tools" videos captured by thermal and RGB-D cameras with hand segmentation data. We analyze multiple models for hand segmentation performance and benchmark four segmentation networks. We show that our multi-modal dataset with fusing Long-Wave InfraRed (LWIR) and RGB-D frames achieves 5% better hand IoU performance than using RGB frames. 
    more » « less
  4. We present EgoRenderer, a system for rendering full-body neural avatars of a person captured by a wearable, egocentric fisheye camera that is mounted on a cap or a VR headset. Our system renders photorealistic novel views of the actor and her motion from arbitrary virtual camera locations. Rendering full-body avatars from such egocentric images come with unique challenges due to the top-down view and large distortions. We tackle these challenges by decomposing the rendering process into several steps, including texture synthesis, pose construction, and neural image translation. For texture synthesis, we propose Ego-DPNet, a neural network that infers dense correspondences between the input fisheye images and an underlying parametric body model, and to extract textures from egocentric inputs. In addition, to encode dynamic appearances, our approach also learns an implicit texture stack that captures detailed appearance variation across poses and viewpoints. For correct pose generation, we first estimate body pose from the egocentric view using a parametric model. We then synthesize an external free-viewpoint pose image by projecting the parametric model to the user-specified target viewpoint. We next combine the target pose image and the textures into a combined feature image, which is transformed into the output color image using a neural image translation network. Experimental evaluations show that EgoRenderer is capable of generating realistic free-viewpoint avatars of a person wearing an egocentric camera. Comparisons to several baselines demonstrate the advantages of our approach. 
    more » « less
  5. null (Ed.)
    The robotics community continually strives to create robots that are deployable in real-world environments. Often, robots are expected to interact with human groups. To achieve this goal, we introduce a new method, the Robot-Centric Group Estimation Model (RoboGEM), which enables robots to detect groups of people. Much of the work reported in the literature focuses on dyadic interactions, leaving a gap in our understanding of how to build robots that can effectively team with larger groups of people. Moreover, many current methods rely on exocentric vision, where cameras and sensors are placed externally in the environment, rather than onboard the robot. Consequently, these methods are impractical for robots in unstructured, human-centric environments, which are novel and unpredictable. Furthermore, the majority of work on group perception is supervised, which can inhibit performance in real-world settings. RoboGEM addresses these gaps by being able to predict social groups solely from an egocentric perspective using color and depth (RGB-D) data. To achieve group predictions, RoboGEM leverages joint motion and proximity estimations. We evaluated RoboGEM against a challenging, egocentric, real-world dataset where both pedestrians and the robot are in motion simultaneously, and show RoboGEM outperformed two state-of-the-art supervised methods in detection accuracy by up to 30%, with a lower miss rate. Our work will be helpful to the robotics community, and serve as a milestone to building unsupervised systems that will enable robots to work with human groups in real-world environments. 
    more » « less