skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Using DeepLabCut to Predict Locations of Subdermal Landmarks from Video
Recent developments in markerless tracking software such as DeepLabCut (DLC) allow estimation of skin landmark positions during behavioral studies. However, studies that require highly accurate skeletal kinematics require estimation of 3D positions of subdermal landmarks such as joint centers of rotation or skeletal features. In many animals, significant slippage between the skin and underlying skeleton makes accurate tracking of skeletal configuration from skin landmarks difficult. While biplanar, high-speed X-ray acquisition cameras offer a way to measure accurate skeletal configuration using tantalum markers and XROMM, this technology is expensive, not widely available, and the manual annotation required is time-consuming. Here, we present an approach that utilizes DLC to estimate subdermal landmarks in a rat from video collected from two standard cameras. By simultaneously recording X-ray and live video of an animal, we train a DLC model to predict the skin locations representing the projected positions of subdermal landmarks obtained from X-ray data. Predicted skin locations from multiple camera views were triangulated to reconstruct depth-accurate positions of subdermal landmarks. We found that DLC was able to estimate skeletal landmarks with good 3D accuracy, suggesting that this might be an approach to provide accurate estimates of skeletal configuration using standard live video.  more » « less
Award ID(s):
2015317
PAR ID:
10424830
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Biomimetic and Biohybrid Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper presents a method of tracking multiple ground targets from an unmanned aerial vehicle (UAV) in a 3D reference frame. The tracking method uses a monocular camera and makes no assumptions on the shape of the terrain or the target motion. The UAV runs two cascaded estimators. The first is an Extended Kalman Filter (EKF), which is responsible for tracking the UAV’s state, such as position and velocity relative to a fixed frame. The second estimator is an EKF that is responsible for estimating a fixed number of landmarks within the camera’s field of view. Landmarks are parameterized by a quaternion associated with bearing from the camera’s optical axis and an inverse distance parameter. The bearing quaternion allows for a minimal representation of each landmark’s direction and distance, a filter with no singularities, and a fast update rate due to few trigonometric functions. Three methods for estimating the ground target positions are demonstrated: the first uses the landmark estimator directly on the targets, the second computes the target depth with a weighted average of converged landmark depths, and the third extends the target’s measured bearing vector to intersect a ground plane approximated from the landmark estimates. Simulation results show that the third target estimation method yields the most accurate results. 
    more » « less
  2. The onset of Industry 4.0 brings a greater demand for Human-Robot Collaboration (HRC) in manufacturing. This has led to a critical need for bridging the sensing and AI with the mechanical-n-physical necessities to successfully augment the robot’s awareness and intelligence. In a HRC work cell, options for sensors to detect human joint locations vary greatly in complexity, usability, and cost. In this paper, the use of depth cameras is explored, since they are a relatively low-cost option that does not require users to wear extra sensing hardware. Herein, the Google Media Pipe (BlazePose) and OpenPose skeleton tracking software packages are used to estimate the pixel coordinates of each human joint in images from depth cameras. The depth at each pixel is then used with the joint pixel coordinates to generate the 3D joint locations of the skeleton. In comparing these skeleton trackers, this paper also presents a novel method of combining the skeleton that the trackers generate from each camera’s data utilizing a quaternion/link-length representation of the skeleton. Results show that the overall mean and standard deviation in position error between the fused skeleton and target locations was lower compared to the skeletons resulting directly from each camera’s data. 
    more » « less
  3. Full-body motion capture is essential for the study of body movement. Video-based, markerless, mocap systems are, in some cases, replacing marker-based systems, but hybrid systems are less explored. We develop methods for coregistration between 2D video and 3D marker positions when precise spatial relationships are not known a priori. We illustrate these methods on three-ball cascade juggling in which it was not possible to use marker-based tracking of the balls, and no tracking of the hands was possible due to occlusion. Using recorded video and motion capture, we aimed to transform 2D ball coordinates into 3D body space as well as recover details of hand motion. We proposed four linear coregistration methods that differ in how they optimize ball-motion constraints during hold and flight phases, using an initial estimate of hand position based on arm and wrist markers. We found that minimizing the error between ball and hand estimate was globally suboptimal, distorting ball flight trajectories. The best-performing method used gravitational constraints to transform vertical coordinates and ball-hold constraints to transform lateral coordinates. This method enabled an accurate description of ball flight as well as a reconstruction of wrist movements. We discuss these findings in the broader context of video/motion capture coregistration. 
    more » « less
  4. S. Koyejo; S. Mohamed; A. Agarwal; D. Belgrave; K. Cho; A. Oh (Ed.)
    Labeling articulated objects in unconstrained settings has a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled. The approach, however, is based on calibrated cameras and rigid geometry, making it expensive, difficult to manage, and impractical in real-world scenarios. In this paper, we address these bottlenecks by combining a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark estimates from videos with only two or three uncalibrated, handheld cameras. With just a few annotations (representing 1−2 % of the frames), we are able to produce 2D results comparable to state-of-the-art fully supervised methods, along with 3D reconstructions that are impossible with other existing approaches. Our Multi-view Bootstrapping in the Wild (MBW) approach demonstrates impressive results on standard human datasets, as well as tigers, cheetahs, fish, colobus monkeys, chimpanzees, and flamingos from videos captured casually in a zoo. We release the codebase for MBW as well as this challenging zoo dataset consisting of image frames of tail-end distribution categories with their corresponding 2D and 3D labels generated from minimal human intervention. 
    more » « less
  5. Labeling articulated objects in unconstrained settings have a wide variety of applications including entertainment, neuroscience, psychology, ethology, and many fields of medicine. Large offline labeled datasets do not exist for all but the most common articulated object categories (e.g., humans). Hand labeling these landmarks within a video sequence is a laborious task. Learned landmark detectors can help, but can be error-prone when trained from only a few examples. Multi-camera systems that train fine-grained detectors have shown significant promise in detecting such errors, allowing for self-supervised solutions that only need a small percentage of the video sequence to be hand-labeled. The approach, however, is based on calibrated cameras and rigid geometry, making it expensive, difficult to manage, and impractical in real-world scenarios. In this paper, we address these bottlenecks by combining a non-rigid 3D neural prior with deep flow to obtain high-fidelity landmark estimates from videos with only two or three uncalibrated, handheld cameras. With just a few annotations (representing 1-2% of the frames), we are able to produce 2D results comparable to state-of-the-art fully supervised methods, along with 3D reconstructions that are impossible with other existing approaches. Our Multi-view Bootstrapping in the Wild (MBW) approach demonstrates impressive results on standard human datasets, as well as tigers, cheetahs, fish, colobus monkeys, chimpanzees, and flamingos from videos captured casually in a zoo. We release the codebase for MBW as well as this challenging zoo dataset consisting image frames of tail-end distribution categories with their corresponding 2D, 3D labels generated from minimal human intervention. 
    more » « less