skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Image synthesis: a review of methods, datasets, evaluation metrics, and future outlook
Image synthesis is a process of converting the input text, sketch, or other sources, i.e., another image or mask, into an image. It is an important problem in the computer vision field, where it has attracted the research community to attempt to solve this challenge at a high level to generate photorealistic images. Different techniques and strategies have been employed to achieve this purpose. Thus, the aim of this paper is to provide a comprehensive review of various image synthesis models covering several aspects. First, the image synthesis concept is introduced. We then review different image synthesis methods divided into three categories: image generation from text, sketch, and other inputs, respectively. Each sub-category is introduced under the proper category based upon the general framework to provide a broad vision of all existing image synthesis methods. Next, brief details of the benchmarked datasets used in image synthesis are discussed along with specifying the image synthesis models that leverage them. Regarding the evaluation, we summarize the metrics used to evaluate the image synthesis models. Moreover, a detailed analysis based on the evaluation metrics of the results of the introduced image synthesis is provided. Finally, we discuss some existing challenges and suggest possible future research directions.  more » « less
Award ID(s):
2025234
NSF-PAR ID:
10421011
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Artificial Intelligence Review
ISSN:
0269-2821
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present Prompt Diffusion, a framework for enabling in-context learning in diffusion-based generative models. Given a pair of task-specific example images, such as depth from/to image and scribble from/to image, and a text guidance, our model automatically understands the underlying task and performs the same task on a new query image following the text guidance. To achieve this, we propose a vision-language prompt that can model a wide range of vision-language tasks and a diffusion model that takes it as input. The diffusion model is trained jointly on six different tasks using these prompts. The resulting Prompt Diffusion model becomes the first diffusion-based vision-language foundation model capable of in-context learning. It demonstrates high-quality in-context generation for the trained tasks and effectively generalizes to new, unseen vision tasks using their respective prompts. Our model also shows compelling text-guided image editing results. Our framework aims to facilitate research into in-context learning for computer vision. We share our code and pre-trained models at https://github. com/Zhendong-Wang/Prompt-Diffusion. 
    more » « less
  2. Abstract

    Sketch2Prototype is an AI-based framework that transforms a hand-drawn sketch into a diverse set of 2D images and 3D prototypes through sketch-to-text, text-to-image, and image-to-3D stages. This framework, shown across various sketches, rapidly generates text, image, and 3D modalities for enhanced early-stage design exploration. We show that using text as an intermediate modality outperforms direct sketch-to-3D baselines for generating diverse and manufacturable 3D models. We find limitations in current image-to-3D techniques, while noting the value of the text modality for user-feedback.

     
    more » « less
  3. Human communication often combines imagery and text into integrated presentations, especially online. In this paper, we show how image–text coherence relations can be used to model the pragmatics of image–text presentations in AI systems. In contrast to alternative frameworks that characterize image–text presentations in terms of the priority, relevance, or overlap of information across modalities, coherence theory postulates that each unit of a discourse stands in specific pragmatic relations to other parts of the discourse, with each relation involving its own information goals and inferential connections. Text accompanying an image may, for example, characterize what's visible in the image, explain how the image was obtained, offer the author's appraisal of or reaction to the depicted situation, and so forth. The advantage of coherence theory is that it provides a simple, robust, and effective abstraction of communicative goals for practical applications. To argue this, we review case studies describing coherence in image–text data sets, predicting coherence from few-shot annotations, and coherence models of image–text tasks such as caption generation and caption evaluation.

     
    more » « less
  4. Face sketch-photo synthesis is a critical application in law enforcement and digital entertainment industry. Despite the significant improvements in sketch-to-photo synthesis techniques, existing methods have still serious limitations in practice, such as the need for paired data in the training phase or having no control on enforcing facial attributes over the synthesized image. In this work, we present a new framework, which is a conditional version of Cycle-GAN, conditioned on facial attributes. The proposed network forces facial attributes, such as skin and hair color, on the synthesized photo and does not need a set of aligned face-sketch pairs during its training. We evaluate the proposed network by training on two real and synthetic sketch datasets. The hand-sketch images of the FERET dataset and the color face images from the WVU Multi-modal dataset are used as an unpaired input to the proposed conditional CycleGAN with the skin color as the controlled face attribute. For more attribute guided evaluation, a synthetic sketch dataset is created from the CelebA dataset and used to evaluate the performance of the network by forcing several desired facial attributes on the synthesized faces. 
    more » « less
  5. Anomaly analysis is an important component of any surveillance system. In recent years, it has drawn the attention of the computer vision and machine learning communities. In this article, our overarching goal is thus to provide a coherent and systematic review of state-of-the-art techniques and a comprehensive review of the research works in anomaly analysis. We will provide a broad vision of computational models, datasets, metrics, extensive experiments, and what anomaly analysis can do in images and videos. Intensively covering nearly 200 publications, we review (i) anomaly related surveys, (ii) taxonomy for anomaly problems, (iii) the computational models, (iv) the benchmark datasets for studying abnormalities in images and videos, and (v) the performance of state-of-the-art methods in this research problem. In addition, we provide insightful discussions and pave the way for future work. 
    more » « less