We propose a single-stage, category-level 6-DoF pose estimation algorithm that simultaneously detects and tracks instances of objects within a known category. Our method takes as input the previous and current frame from a monocular RGB video, as well as predictions from the previous frame, to predict the bounding cuboid and 6- DoF pose (up to scale). Internally, a deep network predicts distributions over object keypoints (vertices of the bounding cuboid) in image coordinates, after which a novel probabilistic filtering process integrates across estimates before computing the final pose using PnP. Our framework allows the system to take previous uncertainties into consideration when predicting the current frame, resulting in predictions that are more accurate and stable than single frame methods. Extensive experiments show that our method outperforms existing approaches on the challenging Objectron benchmark of annotated object videos. We also demonstrate the usability of our work in an augmented reality setting.
Single-Stage Keypoint- Based Category-Level Object Pose Estimation from an RGB Image
Prior work on 6-DoF object pose estimation has largely focused on instance-level processing, in which a textured CAD model is available for each object being detected. Category-level 6- DoF pose estimation represents an important step toward developing robotic vision systems that operate in unstructured, real-world scenarios. In this work, we propose a single-stage, keypoint-based approach for category-level object pose estimation that operates on unknown object instances within a known category using a single RGB image as input. The proposed network performs 2D object detection, detects 2D keypoints, estimates 6- DoF pose, and regresses relative bounding cuboid dimensions. These quantities are estimated in a sequential fashion, leveraging the recent idea of convGRU for propagating information from easier tasks to those that are more difficult. We favor simplicity in our design choices: generic cuboid vertex coordinates, single-stage network, and monocular RGB input. We conduct extensive experiments on the challenging Objectron benchmark, outperforming state-of-the-art methods on the 3D IoU metric (27.6% higher than the MobilePose single-stage approach and 7.1 % higher than the related two-stage approach).
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- International Conference on Robotics and Automation
- Page Range or eLocation-ID:
- 1547 to 1553
- Sponsoring Org:
- National Science Foundation
More Like this
We present an architecture for online, incremental scene modeling which combines a SLAM-based scene understanding framework with semantic segmentation and object pose estimation. The core of this approach comprises a probabilistic inference scheme that predicts semantic labels for object hypotheses at each new frame. From these hypotheses, recognized scene structures are incrementally constructed and tracked. Semantic labels are inferred using a multi-domain convolutional architecture which operates on the image time series and which enables efficient propagation of features as well as robust model registration. To evaluate this architecture, we introduce a large-scale RGB-D dataset JHUSEQ-25 as a new benchmark for the sequence-based scene understanding in complex and densely cluttered scenes. This dataset contains 25 RGB-D video sequences with 100,000 labeled frames in total. We validate our method on this dataset and demonstrate improved performance of semantic segmentation and 6-DoF object pose estimation compared with methods based on the single view.
Tensegrity robots, which are composed of compressive elements (rods) and flexible tensile elements (e.g., cables), have a variety of advantages, including flexibility, low weight, and resistance to mechanical impact. Nevertheless, the hybrid soft-rigid nature of these robots also complicates the ability to localize and track their state. This work aims to address what has been recognized as a grand challenge in this domain, i.e., the state estimation of tensegrity robots through a markerless, vision-based method, as well as novel, onboard sensors that can measure the length of the robot's cables. In particular, an iterative optimization process is proposed to track the 6-DoF pose of each rigid element of a tensegrity robot from an RGB-D video as well as endcap distance measurements from the cable sensors. To ensure that the pose estimates of rigid elements are physically feasible, i.e., they are not resulting in collisions between rods or with the environment, physical constraints are introduced during the optimization. Real-world experiments are performed with a 3-bar tensegrity robot, which performs locomotion gaits. Given ground truth data from a motion capture system, the proposed method achieves less than 1~cm translation error and 3 degrees rotation error, which significantly outperforms alternatives. At the samemore »
As autonomous robots interact and navigate around real-world environments such as homes, it is useful to reliably identify and manipulate articulated objects, such as doors and cabinets. Many prior works in object articulation identification require manipulation of the object, either by the robot or a human. While recent works have addressed predicting articulation types from visual observations alone, they often assume prior knowledge of category-level kinematic motion models or sequence of observations where the articulated parts are moving according to their kinematic constraints. In this work, we propose FormNet, a neural network that identifies the articulation mechanisms between pairs of object parts from a single frame of an RGB-D image and segmentation masks. The network is trained on 100k synthetic images of 149 articulated objects from 6 categories. Synthetic images are rendered via a photorealistic simulator with domain randomization. Our proposed model predicts motion residual flows of object parts, and these flows are used to determine the articulation type and parameters. The network achieves an articulation type classification accuracy of 82.5% on novel object instances in trained categories. Experiments also show how this method enables generalization to novel categories and can be applied to real-world images without fine-tuning.
Promising results have been achieved recently in category-level manipulation that generalizes across object instances. Nevertheless, it often requires expensive real-world data collection and manual specification of semantic keypoints for each object category and task. Additionally, coarse keypoint predictions and ignoring intermediate action sequences hinder adoption in complex manipulation tasks beyond pick-and-place. This work proposes a novel, category-level manipulation framework that leverages an object-centric, category-level representation and model-free 6 DoF motion tracking. The canonical object representation is learned solely in simulation and then used to parse a category-level, task trajectory from a single demonstration video. The demonstration is reprojected to a target trajectory tailored to a novel object via the canonical representation. During execution, the manipulation horizon is decomposed into longrange, collision-free motion and last-inch manipulation. For the latter part, a category-level behavior cloning (CatBC) method leverages motion tracking to perform closed-loop control. CatBC follows the target trajectory, projected from the demonstration and anchored to a dynamically selected category-level coordinate frame. The frame is automatically selected along the manipulation horizon by a local attention mechanism. This framework allows to teach different manipulation strategies by solely providing a single demonstration, without complicated manual programming. Extensive experiments demonstrate its efficacy in a rangemore »