The Visual Dialog task requires a model to exploit both im- age and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large number of conversational questions can be answered by only looking at the image without any access to the context history, while others still need the conversa- tion context to predict the correct answers. We demonstrate that due to this reason, previous joint-modality (history and image) models over-rely on and are more prone to memoriz- ing the dialogue history (e.g., by extracting certain keywords or patterns in the context information), whereas image-only models are more generalizable (because they cannot memo- rize or extract keywords from history) and perform substan- tially better at the primary normalized discounted cumula- tive gain (NDCG) task metric which allows multiple correct answers. Hence, this observation encourages us to explic- itly maintain two models, i.e., an image-only model and an image-history joint model, and combine their complementary abilities for a more balanced multimodal model. We present multiple methods for this integration of the two models, via ensemble and consensus dropout fusion with shared param- eters. Empirically, our models achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and high bal- ance across metrics), and substantially outperform the winner of the Visual Dialog challenge 2018 on most metrics.
more »
« less
Generating Together: Lessons Learned from Developing an Educational Visual Novel with AI Collaboration
Visual novels are a popular game genre for educational games. However, they often feature pre-authored plot structures that cannot dynamically adjust to the player’s progression through learning objectives. Employing procedural storytelling techniques boosts plot dynamism, but this comes at the cost of needing a larger repository of content (dialogue and images) to support different learning progressions and objectives. In this paper, we present postmortem-style case studies describing the lessons we learned from attempting to integrate large-language models (LLMs) and text-to-image models into the development of an educational visual novel about responsible conduct of research. Specifically, we discuss our experiences employing generative AI in our dialogue, character sprite, and background image creation processes.
more »
« less
- Award ID(s):
- 2202521
- PAR ID:
- 10539226
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-5067-8
- Page Range / eLocation ID:
- 1 to 8
- Subject(s) / Keyword(s):
- Procedural Narratives Interactive Narratives Visual Novels Educational Games Game Production Generative AI
- Format(s):
- Medium: X
- Location:
- Milan, Italy
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Current image-based reinforcement learning (RL) algorithms typically operate on the whole image without performing object-level reasoning. This leads to inefficient goal sampling and ineffective reward functions. In this paper, we improve upon previous visual self-supervised RL by incorporating object-level reasoning and occlusion reasoning. Specifically, we use unknown object segmentation to ignore distractors in the scene for better reward computation and goal generation; we further enable occlusion reasoning by employing a novel auxiliary loss and training scheme. We demonstrate that our proposed algorithm, ROLL (Reinforcement learning with Object Level Learning), learns dramatically faster and achieves better final performance compared with previous methods in several simulated visual control tasks.more » « less
-
Diffusion-based Text-to-Image (T2I) models have achieved impressive success in generating high-quality images from textual prompts. While large language models (LLMs) effectively leverage Direct Preference Optimization (DPO) for fine-tuning on human preference data without the need for reward models, diffusion models have not been extensively explored in this area. Current preference learning methods applied to T2I diffusion models immediately adapt existing techniques from LLMs. However, this direct adaptation introduces an estimated loss specific to T2I diffusion models. This estimation can potentially lead to suboptimal performance through our empirical results. In this work, we propose Direct Score Preference Optimization (DSPO), a novel algorithm that aligns the pretraining and fine-tuning objectives of diffusion models by leveraging score matching, the same objective used during pretraining. It introduces a new perspective on preference learning for diffusion models. Specifically, DSPO distills the score function of human-preferred image distributions into pretrained diffusion models, fine-tuning the model to generate outputs that align with human preferences. We theoretically show that DSPO shares the same optimization direction as reinforcement learning algorithms in diffusion models under certain conditions. Our experimental results demonstrate that DSPO outperforms preference learning baselines for T2I diffusion models in human preference evaluation tasks and enhances both visual appeal and prompt alignment of generated images.more » « less
-
Simulationist interactive narrative systems allow game makers to craft reactive stories driven by simulated characters and their social dynamics. These systems produce narrative experiences that feel more emergent but may lack a coherent plot structure. We explored how to combine the emergent possibilities of social simulation with a procedural narrative system that affords writers strong authorial control over the plot. We did this by developing a Unity extension called Anansi that helps people create social simulation-driven visual novels. It enables users to inject simulation data into their story dialogue using logical queries and parameterized storylets written using Ink. The paper describes an overview of our extension and how we empower writers to drive narrative progression using cascading social effects from player choices.more » « less
-
Visual explanation (attention)-guided learning uses not only labels but also explanations to guide the model reasoning process. While visual attention-guided learning has shown promising results, it requires a large number of explanation annotations that are time-consuming to prepare. However, in many real-world situations, it is usually desired to prompt the model with visual attention without model retraining. For example, when doing AI-assisted cancer classification on a medical image, users (e.g., clinicians) can provide the AI model with visual attention prompts on which areas are indispensable and which are precluded. Despite its promising objectives, achieving visual attention-prompted prediction presents several major challenges: 1) How can the visual prompt be effectively integrated into the model's reasoning process? 2) How should the model handle samples that lack visual prompts? 3) What is the impact on the model's performance when a visual prompt is imperfect? This paper introduces a novel framework for visual attention prompted prediction and learning, utilizing visual prompts to steer the model's reasoning process. To improve performance in non-prompted situations and align it with prompted scenarios, we propose a co-training approach for both non-prompted and prompted models, ensuring they share similar parameters and activation. Additionally, for instances where the visual prompt does not encompass the entire input image, we have developed innovative attention prompt refinement methods. These methods interpolate the incomplete prompts while maintaining alignment with the model's explanations. Extensive experiments on four datasets demonstrate the effectiveness of our proposed framework in enhancing predictions for samples both with and without prompt.more » « less
An official website of the United States government

