skip to main content


Title: Image Synthesis From Reconfigurable Layout and Style
Despite remarkable recent progress on both unconditional and conditional image synthesis, it remains a long-standing problem to learn generative models that are capable of synthesizing realistic and sharp images from reconfigurable spatial layout (i.e., bounding boxes + class labels in an image lattice) and style (i.e., structural and appearance variations encoded by latent vectors), especially at high resolution. By reconfigurable, it means that a model can preserve the intrinsic one-to-many mapping from a given layout to multiple plausible images with different styles, and is adaptive with respect to perturbations of a layout and style latent code. In this paper, we present a layout- and style-based architecture for generative adversarial networks (termed LostGANs) that can be trained end-to-end to generate images from reconfigurable layout and style. Inspired by the vanilla StyleGAN, the proposed LostGAN consists of two new components: (i) learning fine-grained mask maps in a weakly-supervised manner to bridge the gap between layouts and images, and (ii) learning object instance-specific layout-aware feature normalization (ISLA-Norm) in the generator to realize multi-object style generation. In experiments, the proposed method is tested on the COCO-Stuff dataset and the Visual Genome dataset with state-of-the-art performance obtained. The code and pretrained models are available at https://github.com/iVMCL/LostGANs  more » « less
Award ID(s):
1909644
NSF-PAR ID:
10122812
Author(s) / Creator(s):
;
Date Published:
Journal Name:
IEEE International Conference on Computer Vision
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper introduces a novel generative encoder (GE) framework for generative imaging and image processing tasks like image reconstruction, compression, denoising, inpainting, deblurring, and super-resolution. GE unifies the generative capacity of GANs and the stability of AEs in an optimization framework instead of stacking GANs and AEs into a single network or combining their loss functions as in existing literature. GE provides a novel approach to visualizing relationships between latent spaces and the data space. The GE framework is made up of a pre-training phase and a solving phase. In the former, a GAN with generator \begin{document}$ G $\end{document} capturing the data distribution of a given image set, and an AE network with encoder \begin{document}$ E $\end{document} that compresses images following the estimated distribution by \begin{document}$ G $\end{document} are trained separately, resulting in two latent representations of the data, denoted as the generative and encoding latent space respectively. In the solving phase, given noisy image \begin{document}$ x = \mathcal{P}(x^*) $\end{document}, where \begin{document}$ x^* $\end{document} is the target unknown image, \begin{document}$ \mathcal{P} $\end{document} is an operator adding an addictive, or multiplicative, or convolutional noise, or equivalently given such an image \begin{document}$ x $\end{document} in the compressed domain, i.e., given \begin{document}$ m = E(x) $\end{document}, the two latent spaces are unified via solving the optimization problem

    and the image \begin{document}$ x^* $\end{document} is recovered in a generative way via \begin{document}$ \hat{x}: = G(z^*)\approx x^* $\end{document}, where \begin{document}$ \lambda>0 $\end{document} is a hyperparameter. The unification of the two spaces allows improved performance against corresponding GAN and AE networks while visualizing interesting properties in each latent space.

     
    more » « less
  2. null (Ed.)
    Adversarial training is a popular defense strategy against attack threat models with bounded Lp norms. However, it often degrades the model performance on normal images and the defense does not generalize well to novel attacks. Given the success of deep generative models such as GANs and VAEs in characterizing the underlying manifold of images, we investigate whether or not the aforementioned problems can be remedied by exploiting the underlying manifold information. To this end, we construct an "On-Manifold ImageNet" (OM-ImageNet) dataset by projecting the ImageNet samples onto the manifold learned by StyleGSN. For this dataset, the underlying manifold information is exact. Using OM-ImageNet, we first show that adversarial training in the latent space of images improves both standard accuracy and robustness to on-manifold attacks. However, since no out-of-manifold perturbations are realized, the defense can be broken by Lp adversarial attacks. We further propose Dual Manifold Adversarial Training (DMAT) where adversarial perturbations in both latent and image spaces are used in robustifying the model. Our DMAT improves performance on normal images, and achieves comparable robustness to the standard adversarial training against Lp attacks. In addition, we observe that models defended by DMAT achieve improved robustness against novel attacks which manipulate images by global color shifts or various types of image filtering. Interestingly, similar improvements are also achieved when the defended models are tested on out-of-manifold natural images. These results demonstrate the potential benefits of using manifold information in enhancing robustness of deep learning models against various types of novel adversarial attacks. 
    more » « less
  3. This paper introduces a deep neural network based method, i.e., DeepOrganNet, to generate and visualize fully high-fidelity 3D / 4D organ geometric models from single-view medical images with complicated background in real time. Traditional 3D / 4D medical image reconstruction requires near hundreds of projections, which cost insufferable computational time and deliver undesirable high imaging / radiation dose to human subjects. Moreover, it always needs further notorious processes to segment or extract the accurate 3D organ models subsequently. The computational time and imaging dose can be reduced by decreasing the number of projections, but the reconstructed image quality is degraded accordingly. To our knowledge, there is no method directly and explicitly reconstructing multiple 3D organ meshes from a single 2D medical grayscale image on the fly. Given single-view 2D medical images, e.g., 3D / 4D-CT projections or X-ray images, our end-to-end DeepOrganNet framework can efficiently and effectively reconstruct 3D / 4D lung models with a variety of geometric shapes by learning the smooth deformation fields from multiple templates based on a trivariate tensor-product deformation technique, leveraging an informative latent descriptor extracted from input 2D images. The proposed method can guarantee to generate high-quality and high-fidelity manifold meshes for 3D / 4D lung models; while, all current deep learning based approaches on the shape reconstruction from a single image cannot. The major contributions of this work are to accurately reconstruct the 3D organ shapes from 2D single-view projection, significantly improve the procedure time to allow on-the-fly visualization, and dramatically reduce the imaging dose for human subjects. Experimental results are evaluated and compared with the traditional reconstruction method and the state-of-the-art in deep learning, by using extensive 3D and 4D examples, including both synthetic phantom and real patient datasets. The efficiency of the proposed method shows that it only needs several milliseconds to generate organ meshes with 10K vertices, which has great potential to be used in real-time image guided radiation therapy (IGRT). 
    more » « less
  4. We present MixNMatch, a conditional generative model that learns to disentangle and encode background, object pose, shape, and texture from real images with minimal supervision, for mix-and-match image generation. We build upon FineGAN, an unconditional generative model, to learn the desired disentanglement and image generator, and leverage adversarial joint image-code distribution matching to learn the latent factor encoders. MixNMatch requires bounding boxes during training to model background, but requires no other supervision. Through extensive experiments, we demonstrate MixNMatch's ability to accurately disentangle, encode, and combine multiple factors for mix-and-match image generation, including sketch2color, cartoon2img, and img2gif applications. Our code/models/demo can be found at https://github.com/Yuheng-Li/MixNMatch 
    more » « less
  5. Existing image-to-image transformation approaches primarily focus on synthesizing visually pleasing data. Generating images with correct identity labels is challenging yet much less explored. It is even more challenging to deal with image transformation tasks with large deformation in poses, viewpoints, or scales while preserving the identity, such as face rotation and object viewpoint morphing. In this paper, we aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image, which can thereby benefit the subsequent fine-grained image recognition and few-shot learning tasks. The generated images, transformed with large geometric deformation, do not necessarily need to be of high visual quality but are required to maintain as much identity information as possible. To this end, we adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image. In order to preserve the fine-grained contextual details of the input image during the deformable transformation, a constrained nonalignment connection method is proposed to construct learnable highways between intermediate convolution blocks in the generator. Moreover, an adaptive identity modulation mechanism is proposed to transfer the identity information into the output image effectively. Extensive experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models, and as a result significantly boosts the visual recognition performance in fine-grained few-shot learning. 
    more » « less