skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Holistic Evaluation for Interleaved Text-and-Image Generation
Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.  more » « less
Award ID(s):
2238940
PAR ID:
10629360
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
22002 to 22016
Format(s):
Medium: X
Location:
Miami, Florida, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Text-to-image generative models have achieved unprecedented success in generating high-quality images based on natural language descriptions. However, it is shown that these models tend to favor specific social groups when prompted with neutral text descriptions (e.g., ‘a photo of a lawyer’). Following Zhao et al. (2021), we study the effect on the diversity of the generated images when adding ethical intervention that supports equitable judgment (e.g., ‘if all individuals can be a lawyer irrespective of their gender’) in the input prompts. To this end, we introduce an Ethical NaTural Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset to evaluate the change in image generations conditional on ethical interventions across three social axes – gender, skin color, and culture. Through CLIP-based and human evaluation on minDALL.E, DALL.E-mini and Stable Diffusion, we find that the model generations cover diverse social groups while preserving the image quality. In some cases, the generations would be anti-stereotypical (e.g., models tend to create images with individuals that are perceived as man when fed with prompts about makeup) in the presence of ethical intervention. Preliminary studies indicate that a large change in the model predictions is triggered by certain phrases such as ‘irrespective of gender’ in the context of gender bias in the ethical interventions. We release code and annotated data at https://github.com/Hritikbansal/entigen_emnlp. 
    more » « less
  2. Adversarial examples are carefully constructed modifications to an input that completely change the output of a classifier but are imperceptible to humans. Despite these successful attacks for continuous data (such as image and audio samples), generating adversarial examples for discrete structures such as text has proven significantly more challenging. In this paper we formulate the attacks with discrete input on a set function as an optimization task. We prove that this set function is submodular for some popular neural network text classifiers under simplifying assumption. This finding guarantees a 1−1/e approximation factor for attacks that use the greedy algorithm. Meanwhile, we show how to use the gradient of the attacked classifier to guide the greedy search. Empirical studies with our proposed optimization scheme show significantly improved attack ability and efficiency, on three different text classification tasks over various baselines. We also use a joint sentence and word paraphrasing technique to maintain the original semantics and syntax of the text. This is validated by a human subject evaluation in subjective metrics on the quality and semantic coherence of our generated adversarial text. 
    more » « less
  3. The popularization of Text-to-Image (T2I) diffusion mod- els enables the generation of high-quality images from text descriptions. However, generating diverse customized im- ages with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting com- monalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distri- bution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mix- ing between multiple distributions. We also show the adapt- ability of the learned prompt distribution to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including auto- matic evaluation and human assessment. 
    more » « less
  4. The popularization of Text-to-Image (T2I) diffusion models enables the genera- tion of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that al- lows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distribution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text- to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including automatic evaluation and human assessment. 
    more » « less
  5. The generalized contrast-to-noise ratio (gCNR) is a relatively new image quality metric designed to assess the probability of lesion detectability in ultrasound images. Although gCNR was initially demonstrated with ultrasound images, the metric is theoretically applicable to multiple types of medical images. In this paper, the applicability of gCNR to photoacoustic images is investigated. The gCNR was computed for both simulated and experimental photoacoustic images generated by amplitude-based (i.e., delay-and-sum) and coherence-based (i.e., short-lag spatial coherence) beamformers. These gCNR measurements were compared to three more traditional image quality metrics (i.e., contrast, contrast-to-noise ratio, and signal-to-noise ratio) applied to the same datasets. An increase in qualitative target visibility generally corresponded with increased gCNR. In addition, gCNR magnitude was more directly related to the separability of photoacoustic signals from their background, which degraded with the presence of limited bandwidth artifacts and increased levels of channel noise. At high gCNR values (i.e., 0.95-1), contrast, contrast-to-noise ratio, and signal-to-noise ratio varied by up to 23.7-56.2 dB, 2.0-3.4, and 26.5-7.6×1020, respectively, for simulated, experimental phantom, andin vivodata. Therefore, these traditional metrics can experience large variations when a target is fully detectable, and additional increases in these values would have no impact on photoacoustic target detectability. In addition, gCNR is robust to changes in traditional metrics introduced by applying a minimum threshold to image amplitudes. In tandem with other photoacoustic image quality metrics and with a defined range of 0 to 1, gCNR has promising potential to provide additional insight, particularly when designing new beamformers and image formation techniques and when reporting quantitative performance without an opportunity to qualitatively assess corresponding images (e.g., in text-only abstracts). 
    more » « less