skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing
Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and relevant concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of our deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training.  more » « less
Award ID(s):
1637949
NSF-PAR ID:
10047463
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
CVPR
Page Range / eLocation ID:
388 to 397
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We introduce an interactive Soft Shadow Network (SSN) to generates controllable soft shadows for image composit- ing. SSN takes a 2D object mask as input and thus is ag- nostic to image types such as painting and vector art. An environment light map is used to control the shadow’s char- acteristics, such as angle and softness. SSN employs an Ambient Occlusion Prediction module to predict an inter- mediate ambient occlusion map, which can be further re- fined by the user to provides geometric cues to modulate the shadow generation. To train our model, we design an efficient pipeline to produce diverse soft shadow training data using 3D object models. In addition, we propose an inverse shadow map representation to improve model train- ing. We demonstrate that our model produces realistic soft shadows in real-time. Our user studies show that the gen- erated shadows are often indistinguishable from shadows calculated by a physics-based renderer and users can eas- ily use SSN through an interactive application to generate specific shadow effects in minutes. 
    more » « less
  2. We present a new weakly supervised learning-based method for generating novel category-specific 3D shapes from unoccluded image collections. Our method is weakly supervised and only requires silhouette annotations from unoccluded, category-specific objects. Our method does not require access to the object's 3D shape, multiple observations per object from different views, intra-image pixel correspondences, or any view annotations. Key to our method is a novel multi-projection generative adversarial network (MP-GAN) that trains a 3D shape generator to be consistent with multiple 2D projections of the 3D shapes, and without direct access to these 3D shapes. This is achieved through multiple discriminators that encode the distribution of 2D projections of the 3D shapes seen from a different views. Additionally, to determine the view information for each silhouette image, we also train a view prediction network on visualizations of 3D shapes synthesized by the generator. We iteratively alternate between training the generator and training the view prediction network. We validate our multi-projection GAN on both synthetic and real image datasets. Furthermore, we also show that multi-projection GANs can aid in learning other high-dimensional distributions from lower dimensional training datasets, such as material-class specific spatially varying reflectance properties from images. 
    more » « less
  3. This paper introduces a deep neural network based method, i.e., DeepOrganNet, to generate and visualize fully high-fidelity 3D / 4D organ geometric models from single-view medical images with complicated background in real time. Traditional 3D / 4D medical image reconstruction requires near hundreds of projections, which cost insufferable computational time and deliver undesirable high imaging / radiation dose to human subjects. Moreover, it always needs further notorious processes to segment or extract the accurate 3D organ models subsequently. The computational time and imaging dose can be reduced by decreasing the number of projections, but the reconstructed image quality is degraded accordingly. To our knowledge, there is no method directly and explicitly reconstructing multiple 3D organ meshes from a single 2D medical grayscale image on the fly. Given single-view 2D medical images, e.g., 3D / 4D-CT projections or X-ray images, our end-to-end DeepOrganNet framework can efficiently and effectively reconstruct 3D / 4D lung models with a variety of geometric shapes by learning the smooth deformation fields from multiple templates based on a trivariate tensor-product deformation technique, leveraging an informative latent descriptor extracted from input 2D images. The proposed method can guarantee to generate high-quality and high-fidelity manifold meshes for 3D / 4D lung models; while, all current deep learning based approaches on the shape reconstruction from a single image cannot. The major contributions of this work are to accurately reconstruct the 3D organ shapes from 2D single-view projection, significantly improve the procedure time to allow on-the-fly visualization, and dramatically reduce the imaging dose for human subjects. Experimental results are evaluated and compared with the traditional reconstruction method and the state-of-the-art in deep learning, by using extensive 3D and 4D examples, including both synthetic phantom and real patient datasets. The efficiency of the proposed method shows that it only needs several milliseconds to generate organ meshes with 10K vertices, which has great potential to be used in real-time image guided radiation therapy (IGRT). 
    more » « less
  4. null (Ed.)
    Material and biological sciences frequently generate large amounts of microscope data that require 3D object level segmentation. Often, the objects of interest have a common geometry, for example spherical, ellipsoidal, or cylindrical shapes. Neural networks have became a popular approach for object detection but they are often limited by their training dataset and have difficulties adapting to new data. In this paper, we propose a volumetric object detection approach for microscopy volumes comprised of fibrous structures by using deep centroid regression and geometric regularization. To this end, we train encoder-decoder networks for segmentation and centroid regression. We use the regression information combined with prior system knowledge to propose cylindrical objects and enforce geometric regularization in the segmentation. We train our networks on synthetic data and then test the trained networks in several experimental datasets. Our approach shows competitive results against other 3D segmentation methods when tested on the synthetic data and outperforms those other methods across different datasets. 
    more » « less
  5. Occlusion is a critical problem in the Autonomous Driving System. Solving this problem requires robust collaboration among autonomous vehicles traveling on the same roads. However, transferring the entirety of raw sensors' data among autonomous vehicles is expensive and can cause a delay in communication. This paper proposes a method called Realtime Collaborative Vehicular Communication based on Bird's-Eye-View (BEV) map. The BEV map holds the accurate depth information from the point cloud image while its 2D representation enables the method to use a novel and well-trained image-based backbone network. Most importantly, we encode the object detection results into the BEV representation to reduce the volume of data transmission and make real-time collaboration between autonomous vehicles possible. The output of this process, the BEV map, can also be used as direct input to most route planning modules. Numerical results show that this novel method can increase the accuracy of object detection by cross-verifying the results from multiple points of view. Thus, in the process, this new method also reduces the object detection challenges that stem from occlusion and partial occlusion. Additionally, different from many existing methods, this new method significantly reduces the data needed for transfer between vehicles, achieving a speed of 21.92 Hz for both the object detection process and the data transmission process, which is sufficiently fast for a real-time system. 
    more » « less