PIPA: Preference Alignment as Prior-Informed Statistical Estimation

Li, J; Wang, Z; Liu, Q

Citation Details

This content will become publicly available on July 24, 2026

PIPA: Preference Alignment as Prior-Informed Statistical Estimation

Offline preference alignment for language models such as Direct Preference Optimization (DPO) is favored for its effectiveness and simplicity, eliminating the need for costly reinforcement learning. Various offline algorithms have been developed for different data settings, yet they lack a unified understanding. In this study, we introduce Pior-Informed Preference Alignment (PIPA), a unified, RL-free probabilistic framework that formulates language model preference alignment as a Maximum Likelihood Estimation (MLE) problem with prior constraints. This method effectively accommodates both paired and unpaired data, as well as answer and step-level annotations. We illustrate that DPO and KTO are special cases with different prior constraints within our framework. By integrating different types of prior information, we developed two variations of PIPA: PIPA-M and PIPA-N. Both algorithms demonstrate a 3∼10% performance enhancement on the GSM8K and MATH benchmarks across all configurations, achieving these gains without additional training or computational costs compared to existing algorithms. more »

Award ID(s):: 2505865

PAR ID:: 10631494

Author(s) / Creator(s):: Li, J; Wang, Z; Liu, Q

Publisher / Repository:: https://doi.org/10.48550/arXiv.2502.05773

Date Published:: 2025-07-24

ISSN:: 2502.05773

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on July 24, 2026
Conference Proceeding:
The DOI is not currently available.

More Like this