skip to main content


Title: Comparison of Human Skeleton Trackers Paired with A Novel Skeleton Fusion Algorithm
The onset of Industry 4.0 brings a greater demand for Human-Robot Collaboration (HRC) in manufacturing. This has led to a critical need for bridging the sensing and AI with the mechanical-n-physical necessities to successfully augment the robot’s awareness and intelligence. In a HRC work cell, options for sensors to detect human joint locations vary greatly in complexity, usability, and cost. In this paper, the use of depth cameras is explored, since they are a relatively low-cost option that does not require users to wear extra sensing hardware. Herein, the Google Media Pipe (BlazePose) and OpenPose skeleton tracking software packages are used to estimate the pixel coordinates of each human joint in images from depth cameras. The depth at each pixel is then used with the joint pixel coordinates to generate the 3D joint locations of the skeleton. In comparing these skeleton trackers, this paper also presents a novel method of combining the skeleton that the trackers generate from each camera’s data utilizing a quaternion/link-length representation of the skeleton. Results show that the overall mean and standard deviation in position error between the fused skeleton and target locations was lower compared to the skeletons resulting directly from each camera’s data.  more » « less
Award ID(s):
1830383
NSF-PAR ID:
10352741
Author(s) / Creator(s):
Date Published:
Journal Name:
ASME Manufacturing Science and Engineering Conference
Page Range / eLocation ID:
10
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Human skeleton data provides a compact, low noise representation of relative joint locations that may be used in human identity and activity recognition. Hierarchical Co-occurrence Network (HCN) has been used for human activity recognition because of its ability to consider correlation between joints in convolutional operations in the network. HCN shows good identification accuracy but requires a large number of samples to train. Acquisition of this large-scale data can be time consuming and expensive, motivating synthetic skeleton data generation for data augmentation in HCN. We propose a novel method that integrates an Auxiliary Classifier Generative Adversarial Network (AC-GAN) and HCN hybrid framework for Assessment and Augmented Identity Recognition for Skeletons (AAIRS). The proposed AAIRS method performs generation and evaluation of synthetic 3-dimensional motion capture skeleton videos followed by human identity recognition. Synthetic skeleton data produced by the generator component of the AC-GAN is evaluated using an Inception Score-inspired realism metric computed from the HCN classifier outputs. We study the effect of increasing the percentage of synthetic samples in the training set on HCN performance. Before synthetic data augmentation, we achieve 74.49% HCN performance in 10-fold cross validation for 9-class human identification. With a synthetic-real mixture of 50%-50%, we achieve 78.22% mean accuracy, significantly 
    more » « less
  2. Recent developments in markerless tracking software such as DeepLabCut (DLC) allow estimation of skin landmark positions during behavioral studies. However, studies that require highly accurate skeletal kinematics require estimation of 3D positions of subdermal landmarks such as joint centers of rotation or skeletal features. In many animals, significant slippage between the skin and underlying skeleton makes accurate tracking of skeletal configuration from skin landmarks difficult. While biplanar, high-speed X-ray acquisition cameras offer a way to measure accurate skeletal configuration using tantalum markers and XROMM, this technology is expensive, not widely available, and the manual annotation required is time-consuming. Here, we present an approach that utilizes DLC to estimate subdermal landmarks in a rat from video collected from two standard cameras. By simultaneously recording X-ray and live video of an animal, we train a DLC model to predict the skin locations representing the projected positions of subdermal landmarks obtained from X-ray data. Predicted skin locations from multiple camera views were triangulated to reconstruct depth-accurate positions of subdermal landmarks. We found that DLC was able to estimate skeletal landmarks with good 3D accuracy, suggesting that this might be an approach to provide accurate estimates of skeletal configuration using standard live video. 
    more » « less
  3. The PoseASL dataset consists of color and depth videos collected from ASL signers at the Linguistic and Assistive Technologies Laboratory under the direction of Matt Huenerfauth, as part of a collaborative research project with researchers at the Rochester Institute of Technology, Boston University, and the University of Pennsylvania. Access: After becoming an authorized user of Databrary, please contact Matt Huenerfauth if you have difficulty accessing this volume. We have collected a new dataset consisting of color and depth videos of fluent American Sign Language signers performing sequences ASL signs and sentences. Given interest among sign-recognition and other computer-vision researchers in red-green-blue-depth (RBGD) video, we release this dataset for use by the research community. In addition to the video files, we share depth data files from a Kinect v2 sensor, as well as additional motion-tracking files produced through post-processing of this data. Organization of the Dataset: The dataset is organized into sub-folders, with codenames such as "P01" or "P16" etc. These codenames refer to specific human signers who were recorded in this dataset. Please note that there was no participant P11 nor P14; those numbers were accidentally skipped during the process of making appointments to collect video stimuli. Task: During the recording session, the participant was met by a member of our research team who was a native ASL signer. No other individuals were present during the data collection session. After signing the informed consent and video release document, participants responded to a demographic questionnaire. Next, the data-collection session consisted of English word stimuli and cartoon videos. The recording session began with showing participants stimuli consisting of slides that displayed English word and photos of items, and participants were asked to produce the sign for each (PDF included in materials subfolder). Next, participants viewed three videos of short animated cartoons, which they were asked to recount in ASL: - Canary Row, Warner Brothers Merrie Melodies 1950 (the 7-minute video divided into seven parts) - Mr. Koumal Flies Like a Bird, Studio Animovaneho Filmu 1969 - Mr. Koumal Battles his Conscience, Studio Animovaneho Filmu 1971 The word list and cartoons were selected as they are identical to the stimuli used in the collection of the Nicaraguan Sign Language video corpora - see: Senghas, A. (1995). Children’s Contribution to the Birth of Nicaraguan Sign Language. Doctoral dissertation, Department of Brain and Cognitive Sciences, MIT. Demographics: All 14 of our participants were fluent ASL signers. As screening, we asked our participants: Did you use ASL at home growing up, or did you attend a school as a very young child where you used ASL? All the participants responded affirmatively to this question. A total of 14 DHH participants were recruited on the Rochester Institute of Technology campus. Participants included 7 men and 7 women, aged 21 to 35 (median = 23.5). All of our participants reported that they began using ASL when they were 5 years old or younger, with 8 reporting ASL use since birth, and 3 others reporting ASL use since age 18 months. Filetypes: *.avi, *_dep.bin: The PoseASL dataset has been captured by using a Kinect 2.0 RGBD camera. The output of this camera system includes multiple channels which include RGB, depth, skeleton joints (25 joints for every video frame), and HD face (1,347 points). The video resolution produced in 1920 x 1080 pixels for the RGB channel and 512 x 424 pixels for the depth channels respectively. Due to limitations in the acceptable filetypes for sharing on Databrary, it was not permitted to share binary *_dep.bin files directly produced by the Kinect v2 camera system on the Databrary platform. If your research requires the original binary *_dep.bin files, then please contact Matt Huenerfauth. *_face.txt, *_HDface.txt, *_skl.txt: To make it easier for future researchers to make use of this dataset, we have also performed some post-processing of the Kinect data. To extract the skeleton coordinates of the RGB videos, we used the Openpose system, which is capable of detecting body, hand, facial, and foot keypoints of multiple people on single images in real time. The output of Openpose includes estimation of 70 keypoints for the face including eyes, eyebrows, nose, mouth and face contour. The software also estimates 21 keypoints for each of the hands (Simon et al, 2017), including 3 keypoints for each finger, as shown in Figure 2. Additionally, there are 25 keypoints estimated for the body pose (and feet) (Cao et al, 2017; Wei et al, 2016). Reporting Bugs or Errors: Please contact Matt Huenerfauth to report any bugs or errors that you identify in the corpus. We appreciate your help in improving the quality of the corpus over time by identifying any errors. Acknowledgement: This material is based upon work supported by the National Science Foundation under award 1749376: "Collaborative Research: Multimethod Investigation of Articulatory and Perceptual Constraints on Natural Language Evolution." 
    more » « less
  4. Closed-loop state estimators that track the movements and behaviors of large-scale populations have significant potential to benefit emergency teams during the critical early stages of disaster response. Such population trackers could enable insight about the population even where few direct measurements are available. In concept, a population tracker might be realized using a Bayesian estimation framework to fuse agent-based models of human movement and behavior with sparse sensing, such as a small set of cameras providing population counts at specific locations. We describe a simple proof-of-concept for such an estimator by applying a particle-filter to synthetic sensor data generated from a small simulated environment. An interesting result is that behavioral models embedded in the particle filter make it possible to distinguish among simulated agents, even when the only available sensor data are aggregate population counts at specific locations. 
    more » « less
  5. null (Ed.)
    Lensless imaging is a new, emerging modality where image sensors utilize optical elements in front of the sensor to perform multiplexed imaging. There have been several recent papers to reconstruct images from lensless imagers, including methods that utilize deep learning for state-of-the-art performance. However, many of these methods require explicit knowledge of the optical element, such as the point spread function, or learn the reconstruction mapping for a single fixed PSF. In this paper, we explore a neural network architecture that performs joint image reconstruction and PSF estimation to robustly recover images captured with multiple PSFs from different cameras. Using adversarial learning, this approach achieves improved reconstruction results that do not require explicit knowledge of the PSF at test-time and shows an added improvement in the reconstruction model’s ability to generalize to variations in the camera’s PSF. This allows lensless cameras to be utilized in a wider range of applications that require multiple cameras without the need to explicitly train a separate model for each new camera. 
    more » « less