skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: Incremental scene understanding on dense SLAM
We present an architecture for online, incremental scene modeling which combines a SLAM-based scene understanding framework with semantic segmentation and object pose estimation. The core of this approach comprises a probabilistic inference scheme that predicts semantic labels for object hypotheses at each new frame. From these hypotheses, recognized scene structures are incrementally constructed and tracked. Semantic labels are inferred using a multi-domain convolutional architecture which operates on the image time series and which enables efficient propagation of features as well as robust model registration. To evaluate this architecture, we introduce a large-scale RGB-D dataset JHUSEQ-25 as a new benchmark for the sequence-based scene understanding in complex and densely cluttered scenes. This dataset contains 25 RGB-D video sequences with 100,000 labeled frames in total. We validate our method on this dataset and demonstrate improved performance of semantic segmentation and 6-DoF object pose estimation compared with methods based on the single view.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Page Range / eLocation ID:
574 to 581
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper introduces key machine learning operations that allow the realization of robust, joint 6D pose estimation of multiple instances of objects either densely packed or in unstructured piles from RGB-D data. The first objective is to learn semantic and instance-boundary detectors without manual labeling. An adversarial training framework in conjunction with physics-based simulation is used to achieve detectors that behave similarly in synthetic and real data. Given the stochastic output of such detectors, candidates for object poses are sampled. The second objective is to automatically learn a single score for each pose candidate that represents its quality in terms of explaining the entire scene via a gradient boosted tree. The proposed method uses features derived from surface and boundary alignment between the observed scene and the object model placed at hypothesized poses. Scene-level, multi-instance pose estimation is then achieved by an integer linear programming process that selects hypotheses that maximize the sum of the learned individual scores, while respecting constraints, such as avoiding collisions. To evaluate this method, a dataset of densely packed objects with challenging setups for state-of-the-art approaches is collected. Experiments on this dataset and a public one show that the method significantly outperforms alternatives in terms of 6D pose accuracy while trained only with synthetic datasets. 
    more » « less
  2. We present a filtering-based method for semantic mapping to simultaneously detect objects and localize their 6 degree-of-freedom pose. For our method, called Contextual Temporal Mapping (or CT-Map), we represent the semantic map as a belief over object classes and poses across an observed scene. Inference for the semantic mapping problem is then modeled in the form of a Conditional Random Field (CRF). CT-Map is a CRF that considers two forms of relationship potentials to account for contextual relations between objects and temporal consistency of object poses, as well as a measurement potential on observations. A particle filtering algorithm is then proposed to perform inference in the CT-Map model. We demonstrate the efficacy of the CT-Map method with a Michigan Progress Fetch robot equipped with a RGB-D sensor. Our results demonstrate that the particle filtering based inference of CT-Map provides improved object detection and pose estimation with respect to baseline methods that treat observations as independent samples of a scene. 
    more » « less
  3. Weakly supervised pose estimation can be used to assist unsupervised body part segmentation and concealed item detection. The accuracy of pose estimation is essential for precise body part segmentation and accurate concealed item detection. In this paper, we show how poses obtained from an RGB pretrained 2D pose detector can be modified for the backscatter image domain. The 2D poses are refined using RANSAC bundle adjustment to minimize the projection loss in 3D. Furthermore, we show how 2D poses can be optimized using a newly proposed 3D-to-2D pose correction network weakly supervised with pose prior regularizers and multi-view pose and posture consistency losses. The optimized 2D poses are used to segment human body parts. We then train a body-part-aware anomaly detection network to detect foreign (concealed threat) objects on segmented body parts. Our work is applied to the TSA passenger screening dataset containing millimeter wave scan images of airport travelers annotated with only binary labels that indicate whether a foreign object is concealed on a body part. Our proposed approach significantly improves the detection accuracy of TSA 2D backscatter images in existing works with a state-of-the-art performance of 97% F1-score, 0.0559 log-loss on the TSA-PSD test-set, and a 74% reduction in 2D pose error. 
    more » « less
  4. Semantic segmentation for scene understanding is nowadays widely demanded, raising significant challenges for the algorithm efficiency, especially its applications on resource-limited platforms. Current segmentation models are trained and evaluated on massive high-resolution scene images (“data-level”) and suffer from the expensive computation arising from the required multi-scale aggregation (“network level”). In both folds, the computational and energy costs in training and inference are notable due to the often desired large input resolutions and heavy computational burden of segmentation models. To this end, we propose DANCE, general automated DA ta- N etwork C o-optimization for E fficient segmentation model training and inference . Distinct from existing efficient segmentation approaches that focus merely on light-weight network design, DANCE distinguishes itself as an automated simultaneous data-network co-optimization via both input data manipulation and network architecture slimming. Specifically, DANCE integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images’ spatial complexity. Such a downsampling operation, in addition to slimming down the cost associated with the input size directly, also shrinks the dynamic range of input object and context scales, therefore motivating us to also adaptively slim the network to match the downsampled data. Extensive experiments and ablating studies (on four SOTA segmentation models with three popular segmentation datasets under two training settings) demonstrate that DANCE can achieve “all-win” towards efficient segmentation (reduced training cost, less expensive inference, and better mean Intersection-over-Union (mIoU)). Specifically, DANCE can reduce ↓25%–↓77% energy consumption in training, ↓31%–↓56% in inference, while boosting the mIoU by ↓0.71%–↑ 13.34%. 
    more » « less
  5. Many manipulation tasks, such as placement or within-hand manipulation, require the object’s pose relative to a robot hand. The task is difficult when the hand significantly occludes the object. It is especially hard for adaptive hands, for which it is not easy to detect the finger’s configuration. In addition, RGB-only approaches face issues with texture-less objects or when the hand and the object look similar. This paper presents a depth-based framework, which aims for robust pose estimation and short response times. The approach detects the adaptive hand’s state via efficient parallel search given the highest overlap between the hand’s model and the point cloud. The hand’s point cloud is pruned and robust global registration is performed to generate object pose hypotheses, which are clustered. False hypotheses are pruned via physical reasoning. The remaining poses’ quality is evaluated given agreement with observed data. Extensive evaluation on synthetic and real data demonstrates the accuracy and computational efficiency of the framework when applied on challenging, highly-occluded scenarios for different object types. An ablation study identifies how the framework’s components help in performance. This work also provides a dataset for in-hand 6D object pose esti- mation. Code and dataset are available at: https://github. com/wenbowen123/icra20-hand-object-pose 
    more » « less