skip to main content


Title: Deep Supervision with Shape Concepts for Occlusion-Aware 3D Object Parsing
Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and relevant concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of our deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training.  more » « less
Award ID(s):
1637949
NSF-PAR ID:
10047463
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
CVPR
Page Range / eLocation ID:
388 to 397
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper introduces a deep neural network based method, i.e., DeepOrganNet, to generate and visualize fully high-fidelity 3D / 4D organ geometric models from single-view medical images with complicated background in real time. Traditional 3D / 4D medical image reconstruction requires near hundreds of projections, which cost insufferable computational time and deliver undesirable high imaging / radiation dose to human subjects. Moreover, it always needs further notorious processes to segment or extract the accurate 3D organ models subsequently. The computational time and imaging dose can be reduced by decreasing the number of projections, but the reconstructed image quality is degraded accordingly. To our knowledge, there is no method directly and explicitly reconstructing multiple 3D organ meshes from a single 2D medical grayscale image on the fly. Given single-view 2D medical images, e.g., 3D / 4D-CT projections or X-ray images, our end-to-end DeepOrganNet framework can efficiently and effectively reconstruct 3D / 4D lung models with a variety of geometric shapes by learning the smooth deformation fields from multiple templates based on a trivariate tensor-product deformation technique, leveraging an informative latent descriptor extracted from input 2D images. The proposed method can guarantee to generate high-quality and high-fidelity manifold meshes for 3D / 4D lung models; while, all current deep learning based approaches on the shape reconstruction from a single image cannot. The major contributions of this work are to accurately reconstruct the 3D organ shapes from 2D single-view projection, significantly improve the procedure time to allow on-the-fly visualization, and dramatically reduce the imaging dose for human subjects. Experimental results are evaluated and compared with the traditional reconstruction method and the state-of-the-art in deep learning, by using extensive 3D and 4D examples, including both synthetic phantom and real patient datasets. The efficiency of the proposed method shows that it only needs several milliseconds to generate organ meshes with 10K vertices, which has great potential to be used in real-time image guided radiation therapy (IGRT). 
    more » « less
  2. null (Ed.)
    We introduce an interactive Soft Shadow Network (SSN) to generates controllable soft shadows for image composit- ing. SSN takes a 2D object mask as input and thus is ag- nostic to image types such as painting and vector art. An environment light map is used to control the shadow’s char- acteristics, such as angle and softness. SSN employs an Ambient Occlusion Prediction module to predict an inter- mediate ambient occlusion map, which can be further re- fined by the user to provides geometric cues to modulate the shadow generation. To train our model, we design an efficient pipeline to produce diverse soft shadow training data using 3D object models. In addition, we propose an inverse shadow map representation to improve model train- ing. We demonstrate that our model produces realistic soft shadows in real-time. Our user studies show that the gen- erated shadows are often indistinguishable from shadows calculated by a physics-based renderer and users can eas- ily use SSN through an interactive application to generate specific shadow effects in minutes. 
    more » « less
  3. We present a new weakly supervised learning-based method for generating novel category-specific 3D shapes from unoccluded image collections. Our method is weakly supervised and only requires silhouette annotations from unoccluded, category-specific objects. Our method does not require access to the object's 3D shape, multiple observations per object from different views, intra-image pixel correspondences, or any view annotations. Key to our method is a novel multi-projection generative adversarial network (MP-GAN) that trains a 3D shape generator to be consistent with multiple 2D projections of the 3D shapes, and without direct access to these 3D shapes. This is achieved through multiple discriminators that encode the distribution of 2D projections of the 3D shapes seen from a different views. Additionally, to determine the view information for each silhouette image, we also train a view prediction network on visualizations of 3D shapes synthesized by the generator. We iteratively alternate between training the generator and training the view prediction network. We validate our multi-projection GAN on both synthetic and real image datasets. Furthermore, we also show that multi-projection GANs can aid in learning other high-dimensional distributions from lower dimensional training datasets, such as material-class specific spatially varying reflectance properties from images. 
    more » « less
  4. Recently, point cloud (PC) has gained popularity in modeling various 3D objects (including both synthetic and real-life) and has been extensively utilized in a wide range of applications such as AR/VR, 3D reconstruction, and autonomous driving. For such applications, it is critical to analyze/understand the surrounding scenes properly. To achieve this, deep learning based methods (e.g., convolutional neural networks (CNNs)) have been widely employed for higher accuracy. Unlike the deep learning on conventional 2D images/videos, where the feature computation (matrix multiplication) is the major bottleneck, in point cloud-based CNNs, the sample and neighbor search stages are the primary bottlenecks, and collectively contribute to 54% (up to 80%) of the overall execution latency on a typical edge device. While prior efforts have attempted to solve this issue by designing custom ASICs or pipelining the neighbor search with other stages, to our knowledge, none of them has tried to “structurize” the unstructured PC data for improving computational efficiency. In this paper, we first explore the opportunities of structurizing PC data using Morton code (which is originally designed to map data from a high dimensional space to one dimension, while preserving spatial locality) and observe that there is a huge scope to “skip” the sample and neighbor search computation by operating on the “structurized” PC data. Based on this, we propose two approximation techniques for the sampling and neighbor search stages. We implemented our proposals on an NVIDIA Jetson AGX Xavier edge GPU board. The evaluation results collected on six different workloads show that our design can accelerate the sample and neighbor search stages by 3.68× (up to 5.21×) with minimal impact on inference accuracy. This acceleration in turn results in 1.55× speedup in the end-to-end execution latency and saves 33% of energy expenditure. 
    more » « less
  5. Medical image segmentation is one of the most challenging tasks in medical image analysis and widely developed for many clinical applications. While deep learning-based approaches have achieved impressive performance in semantic segmentation, they are limited to pixel-wise settings with imbalanced-class data problems and weak boundary object segmentation in medical images. In this paper, we tackle those limitations by developing a new two-branch deep network architecture which takes both higher level features and lower level features into account. The first branch extracts higher level feature as region information by a common encoder-decoder network structure such as Unet and FCN, whereas the second branch focuses on lower level features as support information around the boundary and processes in parallel to the first branch. Our key contribution is the second branch named Narrow Band Active Contour (NB-AC) attention model which treats the object contour as a hyperplane and all data inside a narrow band as support information that influences the position and orientation of the hyperplane. Our proposed NB-AC attention model incorporates the contour length with the region energy involving a fixed-width band around the curve or surface. The proposed network loss contains two fitting terms: (i) a high level feature (i.e., region) fitting term from the first branch; (ii) a lower level feature (i.e., contour) fitting term from the second branch including the (ii1) length of the object contour and (ii2) regional energy functional formed by the homogeneity criterion of both the inner band and outer band neighboring the evolving curve or surface. The proposed NB-AC loss can be incorporated into both 2D and 3D deep network architectures. The proposed network has been evaluated on different challenging medical image datasets, including DRIVE, iSeg17, MRBrainS18 and Brats18. The experimental results have shown that the proposed NB-AC loss outperforms other mainstream loss functions: Cross Entropy, Dice, Focal on two common segmentation frameworks Unet and FCN. Our 3D network which is built upon the proposed NB-AC loss and 3DUnet framework achieved state-of-the-art results on multiple volumetric datasets. 
    more » « less