<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>On a Connection Between Imitation Learning and RLHF</title></titleStmt>
			<publicationStmt>
				<publisher>International Conference on Representation Learning 2025 (ICLR 2025)</publisher>
				<date>04/28/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10638657</idno>
					<idno type="doi"></idno>
					
					<author>Teng Xiao</author><author>Yige Yuan</author><author>Mingxiao Li</author><author>Zhengyu Chen</author><author>Vasant G Honavar</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This work studies the alignment of large language models with preference data from an imitation learning perspective. We establish a close theoretical connection between reinforcement learning from human feedback (RLHF) and imitation learning (IL), revealing that RLHF implicitly performs imitation learning on the preference data distribution. Building on this connection, we propose DIL, a principled framework that directly optimizes the imitation learning objective. DIL provides a unified imitation learning perspective on alignment, encompassing existing alignment algorithms as special cases while naturally introducing new variants. By bridging IL and RLHF, DIL offers new insights into alignment with RLHF. Extensive experiments demonstrate that DIL outperforms existing methods on various challenging benchmarks. The code for DIL is available at https://github.com/tengxiao1/DIL.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Aligning large language models (LLMs) with human preferences is essential to ensure that the responses generated by LLMs align with human expectations <ref type="bibr">(Bai et al., 2022;</ref><ref type="bibr">Ouyang et al., 2022;</ref><ref type="bibr">Stiennon et al., 2020)</ref>. Recently, Reinforcement Learning from Human Feedback (RLHF) <ref type="bibr">(Ouyang et al., 2022;</ref><ref type="bibr">Christiano et al., 2017)</ref> has become a widely adopted framework for fine-tuning language models according to human preference data. This approach typically involves training a reward model based on human feedback and subsequently employing reinforcement learning (RL) algorithms, such as PPO <ref type="bibr">(Schulman et al., 2017)</ref>, to optimize the model to maximize the reward signal.</p><p>RLHF has demonstrated impressive efficacy across a diverse range of tasks, from programming to creative writing. However, its dependence on two-step reinforcement learning presents challenges, such as computational inefficiency and instability during training <ref type="bibr">(Engstrom et al., 2020;</ref><ref type="bibr">Rafailov et al., 2024)</ref>. To mitigate these limitations, alternative one-step approaches, such as direct preference optimization (DPO) and its variants, have been proposed <ref type="bibr">(Rafailov et al., 2024;</ref><ref type="bibr">Meng et al., 2024;</ref><ref type="bibr">Tajwar et al., 2024)</ref>. These methods replace RLHF with supervised learning, eliminating the need for explicit reward modeling. Instead, they directly define an implicit reward based on the likelihood of preference data, resulting in significant gains in efficiency while preserving competitive performance.</p><p>While DPO theoretically aims to discover identical optimal policies as RLHF, it and its variants fundamentally adhere to the reward maximization objective and are determined by parametric models such as the Bradley-Terry (BT) model <ref type="bibr">(Bradley &amp; Terry, 1952)</ref>, making them prone to overfitting <ref type="bibr">(Yuan et al., 2024;</ref><ref type="bibr">Xiao et al., 2024b)</ref> and resulting in suboptimal alignment with preference data <ref type="bibr">(Xu et al., 2024c;</ref><ref type="bibr">Wu et al., 2024)</ref>. This raises a fundamental and open research question: Can we understand and design effective preference optimization algorithms from a new perspective?</p><p>In this paper, we revisit RLHF from the perspective of imitation learning. In particular, we show that RLHF is a special case of a general imitation learning problem expressed exclusively in terms of pairwise preferences. We theoretically demonstrate that alignment with RLHF closely resembles imitation learning and implicitly optimizes the same objective. We then leverage this insight to design DIL, a general framework for effective alignment based on density ratio reward estimation.</p><p>Our primary technical contributions are as follows: (i) We prove that RLHF for alignment is essentially an imitation learning problem and provide a novel analysis that offers explicit guidance for alignemnt algorithm design. (ii) We propose DIL, a simple and generalized imitation learning framework for alignment. DIL unifies imitation learning on preference data and bridges the gap between density ratio estimation and preference alignment. (iii) Empirically, we validate the effectiveness of DIL on widely used benchmarks, demonstrating that it outperforms previous alignment methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Reinforcement Learning from Human Feedback. RLHF has emerged as an effective approach for aligning LLMs with human preferences <ref type="bibr">(Christiano et al., 2017)</ref>. It involves first training a reward model from human feedback through supervised learning, which is then used to optimize a policy via RL algorithms, such as PPO <ref type="bibr">(Schulman et al., 2017)</ref>. RLHF has been successfully applied to a wide range of tasks, including summarization <ref type="bibr">(Stiennon et al., 2020)</ref>, instruction following <ref type="bibr">(Ouyang et al., 2022)</ref>, safety improvement <ref type="bibr">(Bai et al., 2022)</ref>, and truthfulness enhancement <ref type="bibr">(Tian et al., 2023)</ref>.</p><p>Offline Preference Optimization. Recent literature highlights the inherent complexity of RLHF, prompting the search for more efficient offline alternatives. A significant advancement in this area is DPO <ref type="bibr">(Rafailov et al., 2024)</ref>. Unlike RLHF, which first learns an explicit reward model and then fits the policy to rewards, DPO bypasses this second approximation by directly learning a policy from collected data, without the need for reward modeling. DPO implicitly optimizes the same objective as existing RLHF algorithms, but it is simpler to implement and more straightforward to train. Other offline alignment methods, such as <ref type="bibr">IPO (Azar et al., 2024)</ref>, KTO <ref type="bibr">(Ethayarajh et al., 2024)</ref>, and others <ref type="bibr">(Zhao et al., 2023;</ref><ref type="bibr">Xiao &amp; Wang, 2021;</ref><ref type="bibr">Xu et al., 2024a;</ref><ref type="bibr">Meng et al., 2024;</ref><ref type="bibr">Xiao et al., 2025)</ref>, have also been proposed. In contrast, we rethink and design the alignment objective from a novel offline imitation learning perspective. We show that RLHF can theoretically be viewed as imitation learning, which fits the chosen response distribution by minimizing the reverse KL divergence.</p><p>Imitation Learning. Classical imitation learning (IL) methods often frame IL as inverse reinforcement learning (IRL) to better utilize expert demonstrations <ref type="bibr">(Sammut et al., 1992;</ref><ref type="bibr">Abbeel &amp; Ng, 2004</ref>). In the seminal work <ref type="bibr">(Ho &amp; Ermon, 2016)</ref>, the authors introduce GAIL, which bypasses inner-loop reinforcement learning (RL) by establishing a connection between IL and generative adversarial networks (GANs) <ref type="bibr">(Goodfellow et al., 2020)</ref>. GAIL and its successor, AIRL <ref type="bibr">(Fu et al., 2018)</ref>, have made significant strides. However, these online methods typically require substantial environmental interactions, limiting their deployment in cost-sensitive or safety-sensitive domains. To address this issue, recent work on offline IL <ref type="bibr">(Garg et al., 2021)</ref> focuses on learning a reward function from offline datasets to understand and generalize the intentions underlying expert behavior. IQ-Learn <ref type="bibr">(Garg et al., 2021</ref>) simplifies AIRL's game-theoretic objective over policy and reward functions into an optimization over the soft Q-function, which implicitly represents both reward and policy. Recently, some works <ref type="bibr">(Sun &amp; van der Schaar, 2024;</ref><ref type="bibr">Wulfmeier et al., 2024;</ref><ref type="bibr">Xiao et al., 2024a)</ref> have applied imitation learning to the alignment of large language models. Different from the above works, in this paper, we aim at building a close theoretical connection between RLHF and imitation learning, revealing that RLHF implicitly performs imitation learning on the chosen data distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">NOTATIONS AND PRELIMINARIES</head><p>Problem Setup. Let the text sequence x = [x 1 , x 2 , . . .] denote the input prompt, and y w = [y 1 , y 2 , . . .] and y l denote two responses, typically sampled from the same reference policy &#969; ref (y | x). The response pairs are then presented to human labelers (or an oracle) who express preferences for responses given the prompt, denoted as y w &#8594; y l | x, where y w and y l denote preferred and dispreferred responses, respectively. The preference distribution is typically expressed as:</p><p>where g is the sigmoid function &#949;(x) = 1 1+e &#8594;x based on the Bradley-Terry (BT) preference assumption <ref type="bibr">(Bradley &amp; Terry, 1952)</ref>. Given a preference dataset D, containing feedback (x, y w , y l ), the goal of alignment is to learn an LLM policy &#969;(y | x) based on the preference data.</p><p>Reinforcement Learning from Human Feedback. Given the estimated reward function r(x, y), dictating the human preferences, RLHF fine-tunes policy &#969; &#969; by optimizing the following objective:</p><p>where &#977; &gt; 0 is an appropriate KL penalty coefficient. RLHF typically optimizes the above objective in Equation (2) using RL algorithms, such as PPO <ref type="bibr">(Ouyang et al., 2022;</ref><ref type="bibr">Schulman et al., 2017)</ref>.</p><p>Reward Modeling. One standard approach to reward modeling is to fit a reward function r &#949; (x, y) with the BT preference model in Equation (1). Specifically, the reward function r &#949; (x, y) can be estimated by maximizing the log-likelihood over preference feedback (x, y w , y l ):</p><p>Supervised Fine-tuning (SFT). Given a demonstration dataset, the objective of SFT is minimizing the negative log-likelihood over the demonstration data as follows:</p><p>SFT is equivalent to behavior cloning (BC) <ref type="bibr">(Pomerleau, 1988)</ref>, a classical offline imitation learning method that minimizes the forward KL divergence between the learned policy and data policy:</p><p>It is easy to see that the BC problem above shares the same optimal solutions as SFT in expectation.</p><p>Directed Preference Optimization. To simplify the optimization process of RLHF, DPO uses the log-likelihood of the learning policy to implicitly represent the reward function:</p><p>where Z(x) = y &#969; ref (y | x) exp(r &#969; (x, y)/&#977;) is the partition function. By incorporating this reward into the BT model in Equation (1), DPO <ref type="bibr">(Rafailov et al., 2024)</ref> objective enables the comparison of response pairs, facilitating the discrimination between preferred and dispreferred responses:</p><p>Energy-based Models. Energy-based models (EBMs) <ref type="bibr">(LeCun et al., 2006)</ref> define the distribution through an energy function. For y &#8596; R D , its probability density can be expressed as follows:</p><p>where E &#969; (y) : R D &#8599; R is the energy function, mapping the data point y to a scalar, and Z &#969; (y) = y exp(&#8593;E &#969; (y)) is the unknown normalization constant <ref type="bibr">(Song &amp; Kingma, 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">METHODOLOGY</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">RLHF IS A FORM OF IMITATION LEARNING</head><p>In this section, we connect RLHF to the imitation learning framework. We show that RLHF is a special case of imitation learning problem on the distribution chosen response with the reverse KL divergence. Specifically, we firstly define the following policy based on EBMs <ref type="bibr">(Haarnoja et al., 2017)</ref>:</p><p>where &#969; denotes the parameter and</p><p>To learn the parameter &#969;, one can apply behavior cloning <ref type="bibr">(Pomerleau, 1988)</ref>, a classical and widely used imitation learning method, which frames the task as minimizing the KL divergence between the policy &#969; &#949; and the expert policy &#969; chosen generating the chosen response y w . In other works, IL learns the parameter &#969; such that the model distribution imitates the distribution of chosen response in the preference dataset:</p><p>Minimizing the above forward KL divergence with the chosen responses on preference data gives us:</p><p>There are several options for sampling from the reference distribution &#969; ref (y | x). A choice that simplifies the above expression and yields RLHF in practice is</p><p>In this case, the sample-based approximation of the second term gives us:</p><p>One can note that the above imitation learning loss over energy-based policy is exactly the same as the reward loss based on BT assumption in Equation (3) in RLHF. By optimizing this loss function, we can directly obtain the optimal energy-based policy in Equation ( <ref type="formula">9</ref>). Unfortunately, even if we use the estimate r &#949; , it is still expensive to estimate the partition function Z &#949; (x), making this representation difficult to use in practice and significantly higher inference cost <ref type="bibr">(Rafailov et al., 2024)</ref>. To address this problem, we can utilize the reverse knowledge distillation which distills the optimal policy in Equation ( <ref type="formula">9</ref>) into a analytical policy by using the following reverse KL divergence, which allows the final policy &#969; &#969; to require only a single sample during the inference time:</p><p>where &#977; is the temperature hyperparameter in distillation process. This gives the following objective function after removing multiplicative and additive constants:</p><p>One can observe that this distillation objective exactly corresponds to the RL objective in Equation ( <ref type="formula">2</ref>).</p><p>In summary, we provide two key insights: (i) Reward learning in RLHF is equivalent to an imitation learning problem against the chosen responses, achieved by minimizing the forward KL divergence between &#969; chosen and &#969; &#949; based on the EBMs shown in Equation ( <ref type="formula">12</ref>). (ii) The RL step in RLHF can be interpreted as a reverse knowledge distillation process, where the imitated policy &#969; &#949; , based on EBMs, is distilled into a final analytical policy &#969; &#969; by minimizing the reverse KL divergence in Equation ( <ref type="formula">13</ref>), with the temperature determining the level of KL regularization. Formally, we have: Proposition 4.1. Suppose the chosen response distribution p(y | x), the EBM &#969; &#949; (y | x), and the model &#969; &#969; (y | x). KL-regularized RLHF with &#977; = 1 can be viewed as the following problem:</p><p>where</p><p>Thus, conducting imitation learning on the chosen response corresponds to solving a standard KL-regularized RLHF problem. Additionally, we observe that the upper-level objective essentially optimizes a reverse KL (RKL) divergence, KL(&#969; &#969; &#8595; &#969; chosen ), given that &#969; &#8596; &#949; = &#969; chosen , which is the optimum achieved by the lower-level objective.</p><p>An interesting question is why SFT, which directly optimizes the forward KL (FKL) KL(&#969; chosen &#8595; &#969; &#969; ) in Equation ( <ref type="formula">5</ref>), performs worse than RLHF for alignment. Theoretically, minimizing SFT and RLHF should lead to the same optimal solution &#969; &#969; . However, achieving this in practice requires full data coverage and infinite computations, conditions that are rarely met. Consequently, in practical settings, minimizing either KL divergence results in learned policies with distinct properties, as discussed in <ref type="bibr">(Murphy, 2012;</ref><ref type="bibr">Tajwar et al., 2024)</ref>. Specifically, FKL KL(&#969; chosen &#8595; &#969; &#969; ) promotes mass-covering behavior, whereas RKL KL(&#969; &#969; &#8595; &#969; chosen ) encourages mode-seeking behavior <ref type="bibr">(Tajwar et al., 2024;</ref><ref type="bibr">Nachum et al., 2016;</ref><ref type="bibr">Agarwal et al., 2019)</ref>. Mass-covering encourages assigning equal probability to all responses in the dataset, leading to an overestimation of the long tail of the target distribution, while mode-seeking concentrates the probability mass on specific high-reward regions. Thus, alignment focuses on generating a certain subset of high-reward responses, which is more effectively achieved by minimizing reverse KL, as theoretically shown by <ref type="bibr">(Tajwar et al., 2024;</ref><ref type="bibr">Ji et al., 2024)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">DIRECT IMITATION LEARNING</head><p>In the last section, we revisit RLHF from the perspective of imitation learning. Our analysis explicitly suggests that RLHF is essentially optimized to align closely with the distribution of the chosen</p><p>Table 1: Summary of the variants of DIL with different h-functions for Bregman divergence: LDIL <ref type="bibr">(Hastie et al., 2009)</ref> h(r) = r log r &#8593; (r + 1) log(r + 1) log(1 + e &#8599;f&#969; ) log(1 + e f&#969; ) UKL <ref type="bibr">(Nguyen et al., 2010)</ref> h(r) = r log r &#8593; r &#8593;f &#949; e f&#969; responses. The sample-based approximation of EBMs in RLHF results in a reward loss similar to the BT model, as shown in Equation ( <ref type="formula">12</ref>). However, the BT assumption may not always hold true, as discussed in <ref type="bibr">(Azar et al., 2024;</ref><ref type="bibr">Munos et al., 2023;</ref><ref type="bibr">Sun &amp; van der Schaar, 2024)</ref>. Based on these insights, we propose a novel alignment method, DIL, without the BT assumption. We directly formulate the objective of imitation learning as minimizing the reverse KL divergence between &#969; &#969; and the unknown distribution of the chosen response &#969; chosen <ref type="bibr">(Kostrikov et al., 2019;</ref><ref type="bibr">Fu et al., 2018)</ref>:</p><p>where we minimize RKL divergence, rather than FKL divergence as in SFT, as shown in Equation ( <ref type="formula">5</ref>).</p><p>However, mode-seeking with reverse KL divergence is generally challenging. Directly optimizing Equation ( <ref type="formula">16</ref>) does not effectively leverage chosen preference data, particularly because the data policy &#969; chosen is unknown. In the RL literature, these challenges have been addressed through adversarial training <ref type="bibr">(Ho &amp; Ermon, 2016;</ref><ref type="bibr">Fu et al., 2018)</ref>. However, these methods require learning a reward function using complex and unstable adversarial training, which is impractical for large models. In this paper, we propose a straightforward alternative that leverages preference data without learning a reward function via adversarial training. We reformulate the DIL objective as follows:</p><p>where r(x, y) &#8621; &#969; chosen (y|x) &#969; ref (y|x) can be viewed as an auxiliary reward function. Equations ( <ref type="formula">16</ref>) and ( <ref type="formula">17</ref>) are equivalent by adding and subtracting the same term of log &#969; ref (y | x) in the expectation.</p><p>Figure <ref type="figure">1</ref>: The illustration of different losses (LSIF, BCE, and UKL), as shown in Table <ref type="table">1</ref>.</p><p>Interestingly, we find that even when only preference data is available, this objective takes a form similar to the RLHF objective in Equation (2). The primary difference lies in the reward being the estimated log density ratio, which is often not readily accessible in real-world applications. Optimizing this objective, which involves the density ratio r(x, y), is not straightforward. In the next section, we demonstrate how to efficiently optimize it by effectively utilizing offline human preference data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">DENSITY RATIO REWARD ESTIMATION</head><p>Before delving into the problem in Equation ( <ref type="formula">17</ref>), we first describe how to calculate the auxiliary reward function in terms of the density ratio. In the tabular setting, we can directly compute &#969; ref (y | x) and &#969; chosen (y | x). However, in a high-dimensional language domain, estimating the densities separately and then calculating their ratio hardly works well due to error accumulation. In this paper, we choose to directly estimate the density ratio &#969; chosen (y | x)/&#969; ref (y | x) based on the Bregman divergence <ref type="bibr">(Sugiyama et al., 2012)</ref>. Suppose r &#8596; (x, y) = &#969; chosen (y | x)/&#969; ref (y | x) is the target density ratio to be estimated with a parameterized discriminator r &#977; . Then, we have:</p><p>where B h is the data-level Bregman divergence. For a twice continuously differentiable convex function h with a bounded derivative &#982;h, this divergence quantifies the discrepancy between two density-ratios. Subtracting a constant y &#969; ref (y | x)h(r &#8596; (x, y)), we obtain (up to a constant):</p><p>A few non-exhaustive examples of the Bregman divergence are Least-Squared Importance Fitting (LSIF) <ref type="bibr">(Kanamori et al., 2009)</ref>, Binary Cross Entropy (BCE) <ref type="bibr">(Hastie et al., 2009)</ref>, and the unbounded Kullback-Leibl (UKL) <ref type="bibr">(Nguyen et al., 2010)</ref>. For example, LSIF defines h LSIF = (r &#8593; 1) 2 /2, which results in the following instance of Bregman divergence on the density ratio:</p><p>In this case, sample-based approximation of Equation ( <ref type="formula">20</ref>) leads to the following loss function:</p><p>Here, we use the set of rejected responses y l &#8771; &#969; ref (y | x) to approximate the expectations under</p><p>It is acceptable to use the set of rejected responses y l from the preference dataset D to approximate the expectations. We can even make use of both chosen responses and rejected responses to approximate these expectations. However, since our goal is to decrease the likelihood of rejected responses, we choose to use the rejected responses to approximate the expectations, and we find it empirically works well. Intuitively, the first term pushes the model to decrease the density ratio of the rejected response, while the second term increases the density ratio of the chosen response. In addition, this direct estimation approach with h Bregman divergence suggests a divergence family for density ratio estimation as shown in Table <ref type="table">1</ref>; see Appendix A for further discussion of other h functions such as BCE <ref type="bibr">(Hastie et al., 2009)</ref> and UKL <ref type="bibr">(Nguyen et al., 2010)</ref>. We also empirically analyze the effect of using different objectives with h functions in Section 6.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">OPTIMIZATION</head><p>So far, we have observed that the RL-style objective in Equation ( <ref type="formula">17</ref>), combined with density ratio estimation in Equation ( <ref type="formula">21</ref>), can effectively leverage the preference dataset for imitation learning. However, this two-step process is complex and unstable, requiring first the fitting of a reward model to estimate the density ratio and then fine-tuning the language model policy using the RL-style objective in Equation (17). To address these challenges, we introduce a simpler approach that directly optimizes the imitation learning objective, bypassing the need for RL training and density ratio estimation. The core idea lies in a specialized parameterization of the density ratio reward, which allows for the direct extraction of the optimal policy, eliminating the need for an RL loop <ref type="bibr">(Rafailov et al., 2024)</ref>. Notably, the optimal policy in Equation ( <ref type="formula">17</ref>) has a closed-form solution, as shown by <ref type="bibr">(Rafailov et al., 2024)</ref>:</p><p>where Z(x) = y &#969; ref (y|x) exp log r &#8596; (x, y) = y &#969; chosen (y|x) = 1, meaning that the optimal &#969; &#8596; (y|x) is forced to be self-normalized! This characteristic, determined by the reward definition in Equation ( <ref type="formula">17</ref>), is super beneficial as it allows our imitation learning to theoretically generalize to a broader class of loss functions beyond the pairwise BT preference model used in DPO. Taking the logarithm of both sides of Equation ( <ref type="formula">22</ref>) and then with some algebra, we obtain the following:</p><p>where r &#8596; (x, y) is the density ratio estimated by Equation ( <ref type="formula">21</ref>) on the preference dataset. Since the optimal density ratio is now represented in terms of the optimal policy, rather than the discriminator model, we can explicitly derive the following maximum likelihood objective for a parameterized policy over the preference dataset <ref type="bibr">(Rafailov et al., 2024)</ref>. Analogous to the approach used for density ratio estimation and utilizing a change of variables, we can formalize our DIL objective as follows:</p><p>where we directly fit an implicit density ratio in Equation ( <ref type="formula">21</ref>) using an alternative parameterization in Equation ( <ref type="formula">23</ref>). Interestingly, there are no hyperparameters in our loss, yet it achieves promising performance, as demonstrated in our experiments. Since our procedure is equivalent to fitting a reparameterized density ratio estimation model, it theoretically conducts imitation learning by minimizing RKL divergence against the unknown distribution of the chosen response. Table <ref type="table">1</ref> shows a family of objectives that meet the definition of Bregman divergence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">DISCUSSION: DPO IS A SPECIAL CASE OF DIL</head><p>In this section, we show that DPO can be also viewed as a special case of our framework by using contrastive predictive coding (CPC) (also as known as InfoNCE) <ref type="bibr">(Oord et al., 2018)</ref> for density ratio estimation. Given the prompt distribution p(x) and the conditional distribution of the chosen response &#969; chosen (y | x), we sample x &#8771; p(x), y w &#8771; &#969; chosen (y | x), and y l &#8771; &#969; ref (y | x). CPC optimizes:</p><p>where</p><p>The optimal critic for this CPC with one negative sample satisfies the following <ref type="bibr">(Zheng et al., 2024;</ref><ref type="bibr">Ma &amp; Collins, 2018;</ref><ref type="bibr">Oord et al., 2018)</ref>:</p><p>where c(x) is a function <ref type="bibr">(Oord et al., 2018;</ref><ref type="bibr">Zheng et al., 2024)</ref>, that depends on x but not y. Thus, CPC also estimates the density ratio reward in IL objective in Equation ( <ref type="formula">17</ref>). Similar to Section 4.4, by using the closed-form optimal policy in Equation ( <ref type="formula">22</ref>) and using a change of variables, we have:</p><p>which is exactly the same objective as the well-known DPO. Thus, our framework enables us to reinterpret DPO. Specifically, we demonstrate that DPO also falls under the imitation learning objective in Equation ( <ref type="formula">16</ref>) and essentially employs the CPC method for density ratio reward estimation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTS</head><p>Datasets. We evaluate DIL on widely used datasets: the UltraFeedback Binarized dataset <ref type="bibr">(Cui et al., 2023;</ref><ref type="bibr">Tunstall et al., 2023)</ref>, the Reddit TL;DR summarization dataset <ref type="bibr">(V&#246;lske et al., 2017)</ref>, and the Anthropic-HH dataset <ref type="bibr">(Bai et al., 2022)</ref>. The details of these datasets are provided in Appendix B.1.</p><p>Tasks and Evaluation. Following previous work <ref type="bibr">(Rafailov et al., 2024;</ref><ref type="bibr">Tunstall et al., 2023)</ref>, we evaluate methods fine-tuned on the UltraFeedback Binarized dataset across tasks on the Open LLM Leaderboard <ref type="bibr">(Gao et al., 2023)</ref>. The Anthropic HH dataset is used for dialogue generation to produce helpful and harmless responses <ref type="bibr">(Rafailov et al., 2024)</ref>. For summarization, we use the Reddit TL;DR dataset. For these tasks, we use GPT-4 for zero-shot pairwise evaluation (see prompts in Appendix B.2). The task and evaluation details are also provided in Appendix B.2.</p><p>Models. For summarization and dialogue generation tasks, we use Pythia-2.8b <ref type="bibr">(Biderman et al., 2023)</ref> as our base model, with the model after SFT serving as a reference model, following <ref type="bibr">(Rafailov et al., 2024)</ref>. For fine-tuning on the UltraFeedback Binarized dataset, we use Mistral-7B-Base <ref type="bibr">(Tunstall et al., 2023)</ref> and Llama3-8b-SFT used in <ref type="bibr">(Meng et al., 2024)</ref> as our base models.</p><p>Baselines and Implementation. We compare DIL with the following state-of-the-art baselines: DPO <ref type="bibr">(Rafailov et al., 2024)</ref>, f-DPO <ref type="bibr">(Wang et al., 2024a)</ref>, <ref type="bibr">IPO (Azar et al., 2024)</ref>, SLiC <ref type="bibr">(Zhao et al., 2023)</ref>, CPO <ref type="bibr">(Xu et al., 2024b)</ref>, and SimPO <ref type="bibr">(Meng et al., 2024)</ref>. We thoroughly tuned the hyperparameters for each baseline and reported the best performance. The details of the baselines and the hyperparameter search space can be found in Appendix B.3. The density ratio in Section 4.3 is estimated through optimization toward the Bregman divergence. A variety of functions meet the requirements of h, but in all experiments, we choose the widely used LSIF as the default objective. The effect of using different density ratio estimation objectives is empirically analyzed in Section 6.3.  In this section, as shown in Table 2, we compare the performance of DIL against other alignment methods on Open LLM Leaderboard. Our results show that DIL exhibits remarkable effectiveness in improving performance. Overall, DIL consistently outperforms state-of-theart SimPO and DPO in various benchmarks. For instance, on LLama3, the improvements are notable on the Math benchmarks, with relative gains exceeding 7.5% over SimPO. Notably, we observe DPO and SimPO hurt the overall performance in most reasoning-heavy tasks such as GSM8K. This indicates that both SimPO and DPO might not be suitable to improve reasoning abilities, which is consistent with findings in concurrent works <ref type="bibr">(Pal et al., 2024;</ref><ref type="bibr">Meng et al., 2024)</ref>. In contrast, DIL shows clear improvements in both the Mistral and LLama3</p><p>Figure <ref type="figure">2</ref>: The training dynamics of DIL variants, DPO and SimPO on Mistral show that DIL exhibits the smallest decline in chosen likelihoods, while still increasing the likelihood margins between rejected and chosen responses, compared to SimPO and DPO. In contrast, SimPO and DPO progressively focuses on unlearning the chosen responses, leading to poor performance on reasoning tasks.</p><p>models. These findings underscore the effectiveness of DIL. These improvements can be attributed to avoiding the BT assumption and preventing the likelihood decrease of chosen responses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">PERFORMANCE COMPARISON WITH HUMAN PREFERENCES</head><p>We also explore learning from real human preferences, focusing on summarization and dialogue generation tasks. Specifically, we utilize the Reddit TL;DR dataset for summarization and the Anthropic-HH dataset for dialogue generation. We employ Pythia-2.8B <ref type="bibr">(Biderman et al., 2023)</ref> as the base model and fine-tuned it on the chosen completions to train a reference model, ensuring that the completions remained within the model's distribution. Table <ref type="table">3</ref> presents the GPT-4 evaluation results, indicating that DIL outperforms baselines when compared to both the SFT and the chosen responses. Notably, DIL aligns better with human preferences than baselines, achieving a win rate of at least 60% against the chosen responses. This highlights the strong potential of DIL for aligning with human preferences. Furthermore, GPT-4 consistently favored DIL over both baselines and chosen responses, demonstrating improvements of DIL over baselines in both helpfulness and harmlessness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">FURTHER ANALYSIS</head><p>Generalization to other objectives. As mentioned in Section 4.3, our approach to conducting imitation learning from preference data generalizes in a straightforward manner to other density ratio estimations, including UKL and BCE. Table <ref type="table">4</ref> shows the comparison on UltraFeedback. We can observe that different variants of the h-function can lead to general improvements across various benchmarks. Specifically, UKL performs best on BBH, achieving the highest scores on both the Mistral and LLama3 models. BCE achieves a significant improvement on MUSR, with a notable 7.27% increase. These results indicate that appropriate variant can further enhance our performance.</p><p>Training Dynamics. We also analyze the likelihood patterns during the training process of DIL.</p><p>Figure <ref type="figure">2</ref> illustrates the likelihood trends of SimPO and DIL on UltraFeedback. We observe that the likelihood of rejected responses continues to decrease, while the margin between chosen and rejected responses steadily increases. However, for DPO and SimPO, the likelihood of chosen responses drops below zero and continues to decline. These results validate our motivation and demonstrate the effectiveness of DIL in preventing the likelihood of chosen responses from decreasing. This also explains why DIL generally enhances downstream task performance, particularly in reasoning-heavy tasks such as math, as shown in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION</head><p>We consider the problem of aligning large language models with preference data. We provide a novel perspective on imitation learning for the alignment problem, and demonstrate RLHF/DPO essentially conduct imitation learning on the distribution of chosen response with reverse KL divergence.</p><p>Building upon this connection, we propose DIL, which directly optimizes the imitation learning objective based on Bregman divergence and can more effectively preserve reasoning abilities than baselines while aligning with human preferences. Empirical result shows that DIL establishes superior performance on a comprehensive set of benchmarks and different families of language models. We hope that our work will inspire future research on preference alignment with imitation learning,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A DETAILS OF METHODS FOR DENSITY RATIO ESTIMATION</head><p>We overview examples of density ratio estimation methods under Bregman Divergence framework.</p><p>Least Squares Importance Fitting (LSIF). LSIF <ref type="bibr">(Kanamori et al., 2009)</ref> minimizes the squared error between a density ratio model r and the true density ratio r &#8596; defined as follows:</p><p>where the first term in the above equation is constant w.r.t &#969;. This empirical risk minimization is equal to minimizing the empirical BD defined in Equation ( <ref type="formula">18</ref>) with h(r) = (r &#8593; 1)<ref type="foot">foot_1</ref> /2.</p><p>KL Importance Estimation Procedure (KLIEP). KLIEP is derived from the unnormalized Kullback-Leibler (UKL) divergence objective <ref type="bibr">(Sugiyama et al., 2008;</ref><ref type="bibr">Nguyen et al., 2010)</ref>, which uses h(r) = r log(r) &#8593; r. Ignoring terms irrelevant to the optimization, we obtain (up to a constant):</p><p>KLIEP is also known as solving a Lagrangian of the constrained problem with further imposing a constraint that the ratio model r(x, y) is non-negative and normalized as follows:</p><p>s.t. E &#969; ref r(x, y) = 1 and r(x, y) &#8657; 0 for all (x, y).</p><p>Binary Cross Entropy. By using h(r) = log(r) &#8593; (1 + r) log(1 + r), we obtain the following Bregman Divergence called the Binary Cross Entropy (BCE) divergence:</p><p>This Bregman divergence is derived from a formulation of logistic regression <ref type="bibr">(Sugiyama et al., 2012)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B EXPERIMENTAL DETAILS B.1 THE DETAILS OF DATASETS</head><p>UltraFeedback Binarized <ref type="bibr">(Cui et al., 2023;</ref><ref type="bibr">Tunstall et al., 2023)</ref>: This dataset<ref type="foot">foot_0</ref> consists of 64k prompts, each paired with four completions generated by various open-source and proprietary models. GPT-4 assigns scores to these completions based on helpfulness, honesty, and other criteria. Binary preferences are then constructed by selecting the completion with the highest score as the chosen response, while one of the remaining three completions is randomly selected as the rejected response.</p><p>Anthropic-HH <ref type="bibr">(Bai et al., 2022)</ref>: The Anthropic Helpful and Harmless dialogue dataset 2 comprises 170k dialogues between humans and LLM assistants, designed for evaluating single-turn dialogue tasks. Each dialogue consists of a human query and two model responses rated for helpfulness and harmlessness. Following DPO <ref type="bibr">(Rafailov et al., 2024)</ref>, the chosen responses from this dataset were used during the supervised fine-tuning (SFT) phase.</p><p>Reddit TL;DR Summarization <ref type="bibr">(V&#246;lske et al., 2017)</ref>: This dataset<ref type="foot">foot_2</ref> comprises Reddit forum posts curated for summarization tasks with associated preference labels. Following prior work <ref type="bibr">(Stiennon et al., 2020)</ref>, we use a filtered version of this dataset to train the SFT policy and leverage its preference labels during the subsequent alignment phase.</p><p>Table 5: The hyperparameter search space for the baselines.</p><p>Method Method DPO/f-DPO &#982; &#8599; [0.01, 0.05, 0.1] IPO &#962; &#8599; [0.01, 0.1, 0.5, 1.0] CPO &#966; = 1.0 SLiC &#966; &#8599; [0.1, 0.5, 1.0, 10.0] &#982; &#8599; [0.01, 0.05, 0.1] &#8636; &#8599; [0.1, 0.5, 1.0, 2.0] KTO &#966; l = &#966;w = 1.0 SimPO &#982; &#8599; [2.0, 2.5] &#982; &#8599; [0.01, 0.05, 0.1] &#8637; &#8599; [0.3, 0.5, 1.0, 1.2, 1.4, 1.6]</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 THE DETAILS OF TASKS AND EVALUATION</head><p>This section introduces the benchmark for model evaluation. The model fine-tuned on the UltraFeedback Binarized dataset is evaluated following previous works <ref type="bibr">(Rafailov et al., 2024;</ref><ref type="bibr">Tunstall et al., 2023)</ref> using the HuggingFace Open LLM Leaderboard v1<ref type="foot">foot_3</ref> and v2<ref type="foot">foot_4</ref>  <ref type="bibr">(Beeching et al., 2023;</ref><ref type="bibr">Fourrier et al., 2024)</ref>. The evaluation includes MMUL-PRO, BBH, MUSR, MATH, GSM8k, and ARC. Models fine-tuned on the Anthropic-HH dataset follow the evaluation protocol outlined by <ref type="bibr">(Rafailov et al., 2024)</ref>, leveraging GPT-4 for zero-shot pairwise evaluation.</p><p>MMUL-PRO <ref type="bibr">(Wang et al., 2024b</ref>): Short hand for Massive Multitask Language Understanding Professional. This dataset builds on the MMLU by incorporating more complex multiple-choice questions and undergoing rigorous expert review to enhance quality, difficulity and reduce data biases.</p><p>BBH <ref type="bibr">(Suzgun et al., 2022)</ref>: Short hand for Big Bench Hard. This benchmark includes 23 tasks selected from BigBench testing capabilities in arithmetic, comprehension, and general knowledge.</p><p>MUSR <ref type="bibr">(Sprague et al., 2024)</ref>: Short hand for Multistep Soft Reasoning. This benchmark contains complex scenarios to assesses capacity to integrate information and reason across long contexts.</p><p>MATH <ref type="bibr">(Hendrycks et al., 2021)</ref>: This collection comprises math problems for high-school competitions, consistently presented using LaTeX and Asymptote to ensure clear and precise formatting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>GSM8k</head><p>(5-shot) <ref type="bibr">(Cobbe et al., 2021)</ref>: This benchmark consists of grade school math problems to test the model's capability to navigate and solve complex, multi-step mathematical challenges.</p><p>ARC (25-shot) <ref type="bibr">(Clark et al., 2018)</ref>: Short hand for AI2 Reasoning Challenge. This science-focused benchmark includes questions from a grade-school curriculum to test factual and logical reasoning.</p><p>GPT-4 Evaluation <ref type="bibr">(Rafailov et al., 2024)</ref>: The safety of models trained on Anthropic HH is assessed using the Anthropic HH test set, with preferred responses serving as benchmarks. GPT-4's evaluations are aligned with human judgments, ensuring reliable safety assessments. The model version used for these evaluations is gpt-4-0314, with specific prompts detailed in Table <ref type="table">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 IMPLEMENTATION DETAILS</head><p>Training For the general hyperparameters, we closely followed the configurations used in SimPO. Specifically, during the SFT stage, we applied a learning rate of 2 &#8656; 10 &#8599;5 . For both the SFT and preference optimization stages, we used a batch size of 128, a maximum sequence length of 2048, and implemented a cosine learning rate schedule with 10% warmup steps for a single epoch, utilizing the Adam optimizer (Kingma, 2014). These settings were maintained consistently across all experiments to ensure uniformity and comparability. For method-specific hyperparameters, we also adhered to the search strategy outlined in SimPO. For each baseline method, which had its own unique set of hyperparameters, the search strategy is detailed in Table <ref type="table">5</ref>. Learning rates for each method were individually searched within the range of [3e &#8593; 7, 5e &#8593; 7, 6e &#8593; 7, 1e &#8593; 6]. For our methods, we also normalize the likelihood of the response, calculated as the average log probability of all tokens in a response using the policy model, to avoid the length bias as in SimPO. All training experiments described in this paper were conducted using either four NVIDIA A100 80GB GPUs with 128 batchsize, utilizing the codebase from the alignment-handbook repository.<ref type="foot">foot_5</ref> </p><p>Evaluation For the evaluation on the Open LLM Leaderboard V1, we use 5-shot setting for GSM8K, and 25-shot for ARC. For the evaluation on Open LLM Leaderboard V2, all tasks were performed by the default setting of the leaderboard.</p><p>For the following dialogue history to a chatbot, which response is more helpful and harmless?</p><p>Dialogue history: {dialogue history} Response 1: {Response 1} Response 2: {Response 2} FIRST provide a one-sentence comparison of the two responses and explain which you feel is more helpful and harmless. SECOND, on a new line, state only "1" or "2" to indicate which response is more helpful and harmless. Your response should use the format:</p><p>Comparison: &lt;one-sentence comparison and explanation&gt; More helpful: &lt;"1" or "2"&gt; Table <ref type="table">6</ref>: GPT-4 prompt <ref type="bibr">(Rafailov et al., 2024)</ref> for the safety evaluation on the Anthropic-HH dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C FUTURE WORK</head><p>This paper presents several exciting directions for future work. First, we aim to develop a deeper theoretical understanding of which density-ratio estimation techniques are most effective for alignment. Our current analysis is limited to the offline setting and does not account for on-policy learning, where the policy interacts with the reward model during training. Exploring DIL in an on-policy learning scenario would be particularly interesting. Moreover, the relationship between DIL and DPO structurally resembles the connection between Bregman divergence and contrastive predictive coding, suggesting that further exploration of this link could be fruitful. While this paper primarily focuses on three density-ratio estimation loss functions, investigating DIL with additional loss functions would also be a compelling direction for future research.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://huggingface.co/datasets/Anthropic/hh-rlhf</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>https://huggingface.co/datasets/openai/summarize_from_feedback</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>https://github.com/huggingface/alignment-handbook</p></note>
		</body>
		</text>
</TEI>
