skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: 3D integral imaging depth estimation of partially occluded objects using mutual information and Bayesian optimization
Integral imaging (InIm) is useful for passive ranging and 3D visualization of partially-occluded objects. We consider 3D object localization within a scene and in occlusions. 2D localization can be achieved using machine learning and non-machine learning-based techniques. These techniques aim to provide a 2D bounding box around each one of the objects of interest. A recent study uses InIm for the 3D reconstruction of the scene with occlusions and utilizes mutual information (MI) between the bounding box in this 3D reconstructed scene and the corresponding bounding box in the central elemental image to achieve passive depth estimation of partially occluded objects. Here, we improve upon this InIm method by using Bayesian optimization to minimize the number of required 3D scene reconstructions. We evaluate the performance of the proposed approach by analyzing different kernel functions, acquisition functions, and parameter estimation algorithms for Bayesian optimization-based inference for simultaneous depth estimation of objects and occlusion. In our optical experiments, mutual-information-based depth estimation with Bayesian optimization achieves depth estimation with a handful of 3D reconstructions. To the best of our knowledge, this is the first report to use Bayesian optimization for mutual information-based InIm depth estimation.  more » « less
Award ID(s):
2141473
PAR ID:
10425180
Author(s) / Creator(s):
;
Publisher / Repository:
Optical Society of America
Date Published:
Journal Name:
Optics Express
Volume:
31
Issue:
14
ISSN:
1094-4087; OPEXFF
Format(s):
Medium: X Size: Article No. 22863
Size(s):
Article No. 22863
Sponsoring Org:
National Science Foundation
More Like this
  1. Integral imaging has proven useful for three-dimensional (3D) object visualization in adverse environmental conditions such as partial occlusion and low light. This paper considers the problem of 3D object tracking. Two-dimensional (2D) object tracking within a scene is an active research area. Several recent algorithms use object detection methods to obtain 2D bounding boxes around objects of interest in each frame. Then, one bounding box can be selected out of many for each object of interest using motion prediction algorithms. Many of these algorithms rely on images obtained using traditional 2D imaging systems. A growing literature demonstrates the advantage of using 3D integral imaging instead of traditional 2D imaging for object detection and visualization in adverse environmental conditions. Integral imaging’s depth sectioning ability has also proven beneficial for object detection and visualization. Integral imaging captures an object’s depth in addition to its 2D spatial position in each frame. A recent study uses integral imaging for the 3D reconstruction of the scene for object classification and utilizes the mutual information between the object’s bounding box in this 3D reconstructed scene and the 2D central perspective to achieve passive depth estimation. We build over this method by using Bayesian optimization to track the object’s depth in as few 3D reconstructions as possible. We study the performance of our approach on laboratory scenes with occluded objects moving in 3D and show that the proposed approach outperforms 2D object tracking. In our experimental setup, mutual information-based depth estimation with Bayesian optimization achieves depth tracking with as few as two 3D reconstructions per frame which corresponds to the theoretical minimum number of 3D reconstructions required for depth estimation. To the best of our knowledge, this is the first report on 3D object tracking using the proposed approach. 
    more » « less
  2. null (Ed.)
    We propose PSSNet, a network architecture for generating diverse plausible 3D reconstructions from a single 2.5D depth image. Existing methods tend to produce only small variations on a single shape, even when multiple shapes are consistent with an observation. To obtain diversity we alter a Variational Auto Encoder by providing a learned shape bounding box feature as side information during training. Since these features are known during training, we are able to add a supervised loss to the encoder and noiseless values to the decoder. To evaluate, we sample a set of completions from a network, construct a set of plausible shape matches for each test observation, and compare using our plausible diversity metric defined over sets of shapes. We perform experiments using Shapenet mugs and partially-occluded YCB objects and find that our method performs comparably in datasets with little ambiguity, and outperforms existing methods when many shapes plausibly fit an observed depth image. We demonstrate one use for PSSNet on a physical robot when grasping objects in occlusion and clutter. 
    more » « less
  3. We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird’s-eye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models. 
    more » « less
  4. Carbon nanotube (CNT) forests are imaged using scanning electron microscopes (SEMs) that project their multilayered 3D structure into a single 2D image. Image analytics, particularly instance segmentation is needed to quantify structural characteristics and to predict correlations between structural morphology and physical properties. The inherent complexity of individual CNT structures is further increased in CNT forests due to density of CNTs, interactions between CNTs, occlusions, and lack of 3D information to resolve correspondences when multiple CNTs from different depths appear to cross in 2D. In this paper, we propose CNT-NeRF, a generative adversarial network (GAN) for simultaneous depth layer decomposition and segmentation of CNT forests in SEM images. The proposed network is trained using a multi-layer, photo-realistic synthetic dataset obtained by transferring the style of real CNT images to physics-based simulation data. Experiments show promising depth layer decomposition and accurate CNT segmentation results not only for the front layer but also for the partially occluded middle and back layers. This achievement is a significant step towards automated, image-based CNT forest structure characterization and physical property prediction. 
    more » « less
  5. The main challenge of monocular 3D object detection is the accurate localization of 3D center. Motivated by a new and strong observation that this challenge can be remedied by a 3D-space local-grid search scheme in an ideal case, we propose a stage-wise approach, which combines the information flow from 2D-to-3D (3D bounding box proposal generation with a single 2D image) and 3D-to-2D (proposal verification by denoising with 3D-to-2D contexts) in a topdown manner. Specifically, we first obtain initial proposals from off-the-shelf backbone monocular 3D detectors. Then, we generate a 3D anchor space by local-grid sampling from the initial proposals. Finally, we perform 3D bounding box denoising at the 3D-to-2D proposal verification stage. To effectively learn discriminative features for denoising highly overlapped proposals, this paper presents a method of using the Perceiver I/O model [20] to fuse the 3D-to-2D geometric information and the 2D appearance information. With the encoded latent representation of a proposal, the verification head is implemented with a self-attention module. Our method, named as MonoXiver, is generic and can be easily adapted to any backbone monocular 3D detectors. Experimental results on the well-established KITTI dataset and the challenging large-scale Waymo dataset show that MonoXiver consistently achieves improvement with limited computation overhead. 
    more » « less