skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 6, 2026

Title: Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey
Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is organized into three main sections: 1) introduction and preliminaries: an introduction to reinforcement learning frameworks, preference tuning tasks, models, and datasets across various modalities: language, speech, and vision, as well as different policy approaches, 2) in-depth exploration of each preference tuning approach: a detailed analysis of the methods used in preference tuning, and 3) applications, discussion, and future directions: an exploration of the applications of preference tuning in downstream tasks, including evaluation methods for different modalities, and an outlook on future research directions. Our objective is to present the latest methodologies in preference tuning and model alignment, enhancing the understanding of this field for researchers and practitioners. We hope to encourage further engagement and innovation in this area. Additionally, we provide a GitHub link https://github.com/hanyang1999/Preference-Tuning-with-Human-Feedback.  more » « less
Award ID(s):
2206038
PAR ID:
10612825
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
AI Access Foundation
Date Published:
Journal Name:
Journal of Artificial Intelligence Research
Volume:
82
ISSN:
1076-9757
Page Range / eLocation ID:
2595 to 2661
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Diffusion-based Text-to-Image (T2I) models have achieved impressive success in generating high-quality images from textual prompts. While large language models (LLMs) effectively leverage Direct Preference Optimization (DPO) for fine-tuning on human preference data without the need for reward models, diffusion models have not been extensively explored in this area. Current preference learning methods applied to T2I diffusion models immediately adapt existing techniques from LLMs. However, this direct adaptation introduces an estimated loss specific to T2I diffusion models. This estimation can potentially lead to suboptimal performance through our empirical results. In this work, we propose Direct Score Preference Optimization (DSPO), a novel algorithm that aligns the pretraining and fine-tuning objectives of diffusion models by leveraging score matching, the same objective used during pretraining. It introduces a new perspective on preference learning for diffusion models. Specifically, DSPO distills the score function of human-preferred image distributions into pretrained diffusion models, fine-tuning the model to generate outputs that align with human preferences. We theoretically show that DSPO shares the same optimization direction as reinforcement learning algorithms in diffusion models under certain conditions. Our experimental results demonstrate that DSPO outperforms preference learning baselines for T2I diffusion models in human preference evaluation tasks and enhances both visual appeal and prompt alignment of generated images. 
    more » « less
  2. Exploration and reward specification are fundamental and intertwined challenges for reinforcement learning. Solving sequential decision-making tasks requiring expansive exploration requires either careful design of reward functions or the use of novelty-seeking exploration bonuses. Human supervisors can provide effective guidance in the loop to direct the exploration process, but prior methods to leverage this guidance require constant synchronous high-quality human feedback, which is expensive and impractical to obtain. In this work, we present a technique called Human Guided Exploration (HuGE), which uses low-quality feedback from non-expert users that may be sporadic, asynchronous, and noisy. HuGE guides exploration for reinforcement learning not only in simulation but also in the real world, all without meticulous reward specification. The key concept involves bifurcating human feedback and policy learning: human feedback steers exploration, while self-supervised learning from the exploration data yields unbiased policies. This procedure can leverage noisy, asynchronous human feedback to learn policies with no hand-crafted reward design or exploration bonuses. HuGE is able to learn a variety of challenging multi-stage robotic navigation and manipulation tasks in simulation using crowdsourced feedback from non-expert users. Moreover, this paradigm can be scaled to learning directly on real-world robots, using occasional, asynchronous feedback from human supervisors. 
    more » « less
  3. Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches are resource-intensive and may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these challenges by proposing the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning. In the reward modeling, we employ a step-wise strategy and incorporate visual constraints into the self-rewarding process to place greater emphasis on visual input. Empirical results demonstrate that CSR significantly enhances performance and reduces hallucinations across twelve benchmarks and tasks, achieving substantial improvements over existing methods by 7.62%. Our empirical results are further supported by rigorous theoretical analysis, under mild assumptions, verifying the effectiveness of introducing visual constraints into the self-rewarding paradigm. Additionally, CSR shows compatibility with different vision-language models and the ability to incrementally improve performance through iterative fine-tuning. 
    more » « less
  4. Abstract Conceptual design is the foundational stage of a design process, translating ill-defined design problems to low-fidelity design concepts and prototypes. While deep learning approaches are widely applied in later design stages for design automation, we see fewer attempts in conceptual design for three reasons: 1) the data in this stage exhibit multiple modalities: natural language, sketches, and 3D shapes, and these modalities are challenging to represent in deep learning methods; 2) it requires knowledge from a larger source of inspiration instead of focusing on a single design task; and 3) it requires translating designers’ intent and feedback, and hence needs more interaction with designers and/or users. With recent advances in deep learning of cross-modal tasks (DLCMT) and the availability of large cross-modal datasets, we see opportunities to apply these learning methods to the conceptual design of product shapes. In this paper, we review 30 recent journal articles and conference papers across computer graphics, computer vision, and engineering design fields that involve DLCMT of three modalities: natural language, sketches, and 3D shapes. Based on the review, we identify the challenges and opportunities of utilizing DLCMT in 3D shape concepts generation, from which we propose a list of research questions pointing to future research directions. 
    more » « less
  5. This research paper delves into the evolving landscape of fine-tuning large language models (LLMs) to align with human users, extending beyond basic alignment to propose "personality alignment" for language models in organizational settings. Acknowledging the impact of training methods on the formation of undefined personality traits in AI models, the study draws parallels with human fitting processes using personality tests. Through an original case study, we demonstrate the necessity of personality fine-tuning for AIs and raise intriguing questions about applying human-designed tests to AIs, engineering specialized AI personality tests, and shaping AI personalities to suit organizational roles. The paper serves as a starting point for discussions and developments in the burgeoning field of AI personality alignment, offering a foundational anchor for future exploration in human-machine teaming and co-existence. 
    more » « less