NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DSPO: Direct Score Preference Optimization for Diffusion Model Alignment.

Zhu, Huaisheng; Xiao, Teng; Honavar, Vasant G (July 2025, International Conference on Learning Representations (ICLR 2025))

Diffusion-based Text-to-Image (T2I) models have achieved impressive success in generating high-quality images from textual prompts. While large language models (LLMs) effectively leverage Direct Preference Optimization (DPO) for fine-tuning on human preference data without the need for reward models, diffusion models have not been extensively explored in this area. Current preference learning methods applied to T2I diffusion models immediately adapt existing techniques from LLMs. However, this direct adaptation introduces an estimated loss specific to T2I diffusion models. This estimation can potentially lead to suboptimal performance through our empirical results. In this work, we propose Direct Score Preference Optimization (DSPO), a novel algorithm that aligns the pretraining and fine-tuning objectives of diffusion models by leveraging score matching, the same objective used during pretraining. It introduces a new perspective on preference learning for diffusion models. Specifically, DSPO distills the score function of human-preferred image distributions into pretrained diffusion models, fine-tuning the model to generate outputs that align with human preferences. We theoretically show that DSPO shares the same optimization direction as reinforcement learning algorithms in diffusion models under certain conditions. Our experimental results demonstrate that DSPO outperforms preference learning baselines for T2I diffusion models in human preference evaluation tasks and enhances both visual appeal and prompt alignment of generated images.
more » « less
Full Text Available
SimPER: A Minimalist Approach to Preference Alignment without Hyperparameters},

Xiao, Teng; Yuan, Yige; Chen, Zhengyu; Li, Mingxiao; Liang, Shangsong; Ren, Zhaochun; Honavar, Vasant G (July 2025, Proceedings of the International Conference on Learning Representations (ICLR 2025))

Existing preference optimization objectives for language model alignment require additional hyperparameters that must be extensively tuned to achieve optimal performance, increasing both the complexity and time required for fine-tuning large language models. In this paper, we propose a simple yet effective hyperparameter-free preference optimization algorithm for alignment. We observe that promising performance can be achieved simply by optimizing inverse perplexity, which is calculated as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both computationally and memory efficient. Extensive experiments on widely used real-world benchmarks, including MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leaderboard with 5 base models, demonstrate that SimPER consistently and significantly outperforms existing approaches—even without any hyperparameters or a reference model. For example, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. The source code for SimPER is publicly available at: https://github.com/tengxiao1/SimPER.
more » « less
Full Text Available
On a Connection Between Imitation Learning and RLHF

Xiao, Teng; Yuan, Yige; Li, Mingxiao; Chen, Zhengyu; Honavar, Vasant G (April 2025, International Conference on Representation Learning 2025 (ICLR 2025))

This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback (RLHF) and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective on alignment, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By bridging IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks. The code for DIL is available at https://github.com/tengxiao1/DIL.
more » « less
Full Text Available
Checking Consistency of CP-Theory Preferences in Polynomial Time

https://doi.org/10.1609/aaai.v39i14.33659

Rauer, Erik; Basu, Samik; Honavar, Vasant (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

We investigate the problem of checking the consistency of qualitative preferences expressed in CP-theory. This problem is PSPACE-Complete even when the preferences are locally consistent or the preference variables have binary domain. We present a new sufficient condition for consistency of preferences and show that the condition can be checked in polynomial time in settings of practical relevance (locally consistent or binary domain preference variables). We further show how the resulting sufficient condition can be used to efficiently identify a subset of outcomes that are non-dominated with respect to a set of qualitative preferences.
more » « less
Full Text Available
Cal-DPO: Calibrated Direct Preference Optimization for Language Model Alignment

Xiao, Teng; Yuan, Yige; Zhu, Huaisheng; Li, Mingxiao; Honavar, Vasant G (December 2024, 38th Conference on Neural Information Processing Systems (NeurIPS 2024).)

We study the problem of aligning large language models (LLMs) with human preference data. Contrastive preference optimization has shown promising results in aligning LLMs with available preference data by optimizing the implicit reward associated with the policy. However, the contrastive objective focuses mainly on the relative values of implicit rewards associated with two responses while ignoring their actual values, resulting in suboptimal alignment with human preferences. To address this limitation, we propose calibrated direct preference optimization (Cal-DPO), a simple yet effective algorithm. We show that substantial improvement in alignment with the given preferences can be achieved simply by calibrating the implicit reward to ensure that the learned implicit rewards are comparable in scale to the ground-truth rewards. We demonstrate the theoretical advantages of Cal-DPO over existing approaches. The results of our experiments on a variety of standard benchmarks show that Cal-DPO remarkably improves off-the-shelf methods.
more » « less
Full Text Available
Representing and Reasoning with Multi-Stakeholder Qualitative Preference Queries

https://doi.org/10.3233/FAIA230272

Basu, Samik; Honavar, Vasant; Santhanam, Ganesh R; Tao, Jia (September 2023, Frontiers in Artificial Intelligence and Applications, Proceedings of ECAI)

Many decision-making scenarios, e.g., public policy, healthcare, business, and disaster response, require accommodating the preferences of multiple stakeholders. We offer the first formal treatment of reasoning with multi-stakeholder qualitative preferences in a setting where stakeholders express their preferences in a qualitative preference language, e.g., CP-net, CI-net, TCP-net, CP-Theory. We introduce a query language for expressing queries against such preferences over sets of outcomes that satisfy specified criteria, e.g., ψ1PAψ2 (read loosely as the set of outcomes satisfying ψ1 that are preferred over outcomes satisfying ψ2 by a set of stakeholders A). Motivated by practical application scenarios, we introduce and analyze several alternative semantics for such queries, and examine their interrelationships. We provide a provably correct algorithm for answering multi-stakeholder qualitative preference queries using model checking in alternation-free μ-calculus. We present experimental results that demonstrate the feasibility of our approach.
more » « less
Full Text Available

Search for: All records