<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Align-Pro: A Principled Approach to Prompt Optimization for LLM Alignment</title></titleStmt>
			<publicationStmt>
				<publisher>AAAI</publisher>
				<date>04/11/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10591590</idno>
					<idno type="doi">10.1609/aaai.v39i26.34979</idno>
					<title level='j'>Proceedings of the AAAI Conference on Artificial Intelligence</title>
<idno>2159-5399</idno>
<biblScope unit="volume">39</biblScope>
<biblScope unit="issue">26</biblScope>					

					<author>Prashant Trivedi</author><author>Souradip Chakraborty</author><author>Avinash Reddy</author><author>Vaneet Aggarwal</author><author>Amrit Singh Bedi</author><author>George K Atia</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>The alignment of large language models (LLMs) with human values is critical as these models become increasingly integrated into various societal and decision-making processes. Traditional methods, such as reinforcement learning from human feedback (RLHF), achieve alignment by fine-tuning model parameters, but these approaches are often computationally expensive and impractical when models are frozen or inaccessible for parameter modification. In contrast, prompt optimization is a viable alternative to RLHF for LLM alignment. While the existing literature has shown empirical promise of prompt optimization, its theoretical underpinning remains under-explored. We address this gap by formulating prompt optimization as an optimization problem and try to provide theoretical insights into the optimality of such a framework. To analyze the performance of the prompt optimization, we study theoretical suboptimality bounds and provide insights in terms of how prompt optimization depends upon the given prompter and target model. We also provide empirical validation through experiments on various datasets, demonstrating that prompt optimization can effectively align LLMs, even when parameter fine-tuning is not feasible.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>The quest to align large language models (LLMs) with human values is not just an academic pursuit but a practical necessity <ref type="bibr">(Wang et al. 2024;</ref><ref type="bibr">Kaufmann et al. 2023)</ref>. As these AI models (e.g., <ref type="bibr">ChatGPT,</ref><ref type="bibr">Llamma2,</ref><ref type="bibr">etc.)</ref> increasingly become an essential part of various aspects of daily life and decision-making processes, ensuring their outputs reflect ethical considerations and societal norms becomes crucial <ref type="bibr">(Li, Krishna, and Lakkaraju 2024;</ref><ref type="bibr">Dai et al. 2023)</ref>. The standard approach to aligning LLMs has been through fine-tuning parameters via reinforcement learning from human feedback (RLHF) <ref type="bibr">(Zhu, Jiao, and Jordan 2023;</ref><ref type="bibr">Azar et al. 2023;</ref><ref type="bibr">Ziegler et al. 2019a)</ref>, which involves three main steps: Supervised Fine-Tuning (SFT), reward learning, and RL fine-tuning. However, this process can be resource-intensive, as it necessitates updating model parameters <ref type="bibr">(Casper et al. 2023;</ref><ref type="bibr">Ouyang et al. 2022)</ref>. A further complication to alignment arises when models are Figure <ref type="figure">1</ref>: A basic overview of the prompt optimization framework. A prompter modifies the prompt before passing it through the target frozen LLM. either 'frozen' or operate as 'black box,' where direct access to tweak parameters is restricted <ref type="bibr">(Diao et al. 2023;</ref><ref type="bibr">Shin et al. 2020)</ref>. These scenarios pose a critical question: How can we ensure LLM alignment when parameter updates are not allowed or possible?</p><p>One promising solution lies in the concept of prompt optimization <ref type="bibr">(Lin et al. 2024;</ref><ref type="bibr">Li and Liang 2018;</ref><ref type="bibr">Lester, Al-Rfou, and Constant 2021)</ref>. This technique leverages the idea that the output of an LLM is a function of the input prompt-thereby turning the prompt into a powerful tool to elicit desired responses to align with specific rewards (cf. Figure <ref type="figure">1</ref>). Various empirical studies in the literature have shown the significant benefits of prompt optimization techniques for LLM alignment <ref type="bibr">(Shin et al. 2020;</ref><ref type="bibr">Kong et al. 2024;</ref><ref type="bibr">Wang et al. 2023b</ref>). However, theoretical insights about the working of prompt optimization have not been well studied. This raises an important question about the optimality of prompt optimization compared to traditional fine-tuning: Can prompt optimization for LLM alignment achieve performance comparable to fine-tuning?</p><p>In this work, we try to investigate and answer the above question. To the best of our knowledge, there is a notable absence of literature focusing on a theoretical formulation of prompt optimization specifically for LLM alignment. This paper aims to fill this gap by developing a unified optimization framework (called Align-Pro) to analyze prompt optimization for LLM alignment. We explore its theoretical performance, particularly in terms of suboptimality bounds, which measure how close the responses generated via the prompt optimization are to the outcomes obtained through fine-tuned models. Furthermore, we provide proof of concept empirical evidence to support the theoretical insights. We summarize our main contributions as follows.</p><p>&#8226; An optimization framework to prompt optimization for LLM alignment. We propose Align-Pro: a prompt optimization framework where we motivate the optimization objective, which would help reduce the suboptimality gap in the alignment. The optimization problem considered allows us to theoretically study the prompt optimization for LLM alignment. Following the standard analysis of LLM alignment, we derive a closed-form expression for the optimal prompt distribution. &#8226; We study the suboptimality of prompt optimization with respect to the fine-tuning method. We establish theoretical bounds on the difference between the expected rewards obtained from the fine-tuned policy, which represents the benchmark for model performance, and the optimal policy derived from our prompt optimization approach. &#8226; Experimental results. We conduct a series of experiments on three datasets to support the insights we obtain from the theoretical analysis. Align-Pro demonstrates better performance in terms of the mean rewards and win rate over the baseline without fine-tuning, showcasing its effectiveness across three datasets and diverse model configurations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head><p>RLHF and LLM fine-tuning: RLHF has become the most widely used method for aligning LLM responses with human values <ref type="bibr">(Dubois et al. 2024;</ref><ref type="bibr">Ouyang et al. 2022;</ref><ref type="bibr">Ziegler et al. 2019b)</ref>. For a more comprehensive discussion on RLHF, refer to some recent surveys <ref type="bibr">(Casper et al. 2023;</ref><ref type="bibr">Chaudhari et al. 2024)</ref>. Recently, some methods have been developed to bypass the need for RL, directly utilizing a preference dataset for alignment, including direct preference optimization (DPO) <ref type="bibr">(Rafailov et al. 2024)</ref>, SLiC <ref type="bibr">(Zhao et al. 2023)</ref>, and other extensions <ref type="bibr">(Amini, Vieira, and Cotterell 2024;</ref><ref type="bibr">Azar et al. 2024;</ref><ref type="bibr">Gou and Nguyen 2024;</ref><ref type="bibr">Liu et al. 2024;</ref><ref type="bibr">Morimura et al. 2024;</ref><ref type="bibr">Tang et al. 2024;</ref><ref type="bibr">Wang et al. 2023a)</ref>. The recent work of <ref type="bibr">(Dwaracherla et al. 2024</ref>) has demonstrated the potential of efficient exploration methods to improve LLM responses based on human preference feedback. Moreover, methods such as ORPO <ref type="bibr">(Hong, Lee, and Thorne 2024)</ref> align the model without using a reference model. Furthermore, intuitive fine-tuning (IFT) conducts alignment solely relying on positive samples and a single policy, starting from a pre-trained base model <ref type="bibr">(Hua et al. 2024)</ref>. However, all of these approaches are focused on alignment via parameter fine-tuning.</p><p>Prompt optimization for alignment: Prompt optimization has seen significant growth in recent years. Early efforts focused on white-box LLMs, such as AutoPrompt <ref type="bibr">(Shin et al. 2020)</ref> and FluentPrompt (Shi et al. 2023), which used gradient-based methods to generate prompts from labeled data. Soft prompt methods, such as (Lester, Al-Rfou, and Constant 2021; Li and Liang 2021; Zhong, Friedman, and Chen 2021), also gained traction. Recently, the focus has shifted to optimizing prompts for black-box LLMs. Techniques like clip-tuning (Chai et al. 2022), BBT (Sun et al. 2022b), and BBTv2 (Sun et al. 2022a) optimize prompts by leveraging input embeddings and output logits, often using low-dimensional subspace optimization. Some approaches use RL ideas for prompt optimization for alignment, including BDPL (Diao et al. 2023), PRewrite (Kong et al. 2024), and MultiPrompter (Kim et al. 2023), which iteratively update prompts. Planning-based approaches, such as PromptAgent (Wang et al. 2023b), have also gained attention. Additionally, APOHF (Lin et al. 2024)</p><p>leverages dueling bandits theory to refine prompts using preference feedback. However, theoretical connections in terms of comparing the performance of prompt optimization with the fine-tuning approach are not studied in detail.</p><p>Other works with similar formulations: Beyond prompt optimization and fine-tuning, other areas share similar theoretical formulations. For instance, <ref type="bibr">(Hong et al. 2024;</ref><ref type="bibr">Perez et al. 2022;</ref><ref type="bibr">Wichers, Denison, and Beirami 2024;</ref><ref type="bibr">Lee et al. 2024;</ref><ref type="bibr">Beetham et al. 2024</ref>) explore automated red teaming by training a red team LLM with reinforcement learning to generate test cases that provoke undesirable responses from a target LLM. While the context differs, the red team model's training objective aligns closely with our prompt optimization objective. In contrast, in this work, we motivate the selection of objectives for prompt optimization and focus on understanding the suboptimality of prompt optimization with respect to fine-tuned models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Preliminaries and Background</head><p>This section provides the essential background and foundational concepts relevant to alignment. We start by defining the notation, followed by a quick overview of the RLHF framework, which involves three key steps: (i) supervised fine-tuning (SFT), (ii) reward learning, and (iii) fine-tuning with RL. Language Models. We start by defining the language model mathematically. Let us denote the vocabulary set by V, and we denote the language model by &#960;(y|x), which takes in the sequence of tokens x := {x 1 , x 2 , &#8226; &#8226; &#8226; , x N } (with each x i &#8712; V) as an input, and generate response y := {y 1 , y 2 , &#8226; &#8226; &#8226; , y M } (with each y i &#8712; V) as the output. At instant t, each output token y t &#8764; &#960;(&#8226;|x t ). Supervised Fine-Tuning (SFT). SFT is the initial step in the RLHF process. It involves fine-tuning a pre-trained LLM on a vast dataset of human-generated text in a supervised manner. Reward Learning. This stage involves learning the reward model by gathering preferences from experts/human feedback or an oracle based on outputs generated by the SFT model denoted by &#960; sft . The optimization is generally performed under the Bradley-Terry model for pairwise comparison <ref type="bibr">(Bradley and Terry 1952)</ref>, which seeks to minimize the loss formulated as:</p><p>where D r denotes the dataset of response pairs (y u , y v ), with y u and y v representing the winning and the losing responses, respectively, which are generated by the policy &#960; sft optimized under the reward r(x, y), and evaluated by human experts or an oracle function p * (&#8226;|y u , y v , x), and &#963;(&#8226;) is the sigmoid function.</p><p>Fine-tuning with RL. In this step, we obtain the aligned model which maximizes the reward model r(x, y) (trained in the previous step) by solving a KL-regularized optimization problem:</p><p>where, &#946; &gt; 0 is a parameter that controls the deviation from the baseline policy &#960; sft . This iterative process alternates between updating the policy and reward models until convergence, as detailed in previous works <ref type="bibr">(Kaufmann et al. 2023;</ref><ref type="bibr">Zhu, Jiao, and Jordan 2023)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Prompt Optimization Framework for LLM Alignment</head><p>In this section, we provide a mathematical formulation for the framework of prompt optimization for LLM alignment.</p><p>In traditional LLM alignment, as described in (2), the model parameters are fine-tuned to adjust the response distributions in a way that maximizes the reward function. However, in our setting, we operate under a different regime, starting with a pre-trained language model, denoted by &#960; F , whose parameters remain frozen. In this case, direct modification of the model to align with a reward function is not allowed. Therefore, an alternative and widely adopted approach in the literature is to optimize the input prompt itself to yield better-aligned responses <ref type="bibr">(Kong et al. 2024;</ref><ref type="bibr">Shin et al. 2020;</ref><ref type="bibr">Zhou et al. 2023)</ref>. Typically, this process involves iterative prompt refinement, where the model outputs are evaluated and compared to human preferences, and the prompts are adjusted accordingly. However, such iterative fine-tuning can be computationally expensive and time-intensive.</p><p>Interestingly, although we cannot fine-tune the frozen model &#960; F , we can fine-tune the prompter model &#961; in any desired manner. However, a fundamental challenge arises: what should be the objective for optimizing the prompter? While substantial empirical evidence in the literature demonstrates that prompt optimization can significantly enhance response generation and improve alignment <ref type="bibr">(Shin et al. 2020;</ref><ref type="bibr">Kong et al. 2024;</ref><ref type="bibr">Zhou et al. 2023)</ref>, there is no specific emphasis on developing a mathematical framework to guide this process. We start by addressing this gap as follows.</p><p>Optimization Objective for Prompter Design. First, we revisit the basics of LLM alignment. For a given prompt x, the probability of generating a response y from the frozen model is represented by &#960; F (y|x). After introducing the prompter model &#961;, the probability of generating response y given input x (denoted by &#960; &#961; ) can be expressed as:</p><p>which captures the probability of generating the response y for a given x under the influence of the prompter &#961;. Let us consider the ideal scenario: if we were able to fine-tune the language model &#960; F , we would solve the optimization problem in (2) and obtain the RLHF optimal solution &#960; * , which is given by <ref type="bibr">(Peng et al. 2019;</ref><ref type="bibr">Peters and Schaal 2007</ref>)</p><p>where Z * (x) = y &#960; F (y|x) exp(r * (x, y)/&#946;) is the normalizing constant, and &#946; is the alignment tuning parameter, and reward r * is obtained from solving (1). We emphasize that if we have a prompter &#961; that performs as well as the RLHF-optimal policy &#960; * , it should be a sufficient indicator of a good prompter. With this understanding, we consider the following prompter suboptimality gap given by</p><p>which captures how well our prompter is doing with respect to fine-tuned optimal policy &#960; * . Mathematically, it holds that</p><p>Equation ( <ref type="formula">6</ref>) evaluates the difference in expected return between the optimal RLHF policy &#960; * and our prompt optimization policy &#960; &#961; , indicating how much better (or worse) &#960; * performs compared to &#960; &#961; . We highlight that this performance gap is clearly influenced by the choice of the prompt distribution &#961;; a non-optimal &#961; can result in a significant gap. This leads us to the following questions:</p><p>&#8226; Q1: Can we design an optimal prompter &#961; * that closes the suboptimality gap between the fine-tuned policy &#960; * , and the prompt optimization policy &#960; &#961; * as mentioned in Equation ( <ref type="formula">6</ref>)? &#8226; Q2: If such a &#961; * exists, then can &#960; &#961; * outperform the fine-tuned optimal policy &#960; * ? We address these questions in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proposed Approach: Align-Pro</head><p>Let us start by addressing Q1 and develop a general prompt optimization framework to design an optimal prompter &#961; * . But then the first question arises: in what sense is &#961; * optimal? In order to see that, let us reconsider J(&#960; * ) -J( &#960; &#961; ) and after adding-subtracting E y&#8764;&#960; F (&#8226;|x) [r * (x, y)] in the right hand side of Equation ( <ref type="formula">6</ref>), we get</p><p>where &#8710; 1 and &#8710; 2 are defined as</p><p>We remark that in (7), &#8710; 1 is the suboptimality gap between the optimal fine-tuned policy, and the frozen model &#960; F . Thus, it captures the effectiveness of the optimal RLHF policy with respect to the frozen model. In other words, it quantifies how good or bad our frozen model is with respect to the optimally aligned model. We note that &#8710; 1 is constant for a given &#960; F and does not depend upon prompter &#961;, hence we cannot improve this part with the prompter. Another insight is that since &#960; * is the optimal RLHF policy, &#8710; 1 &#8805; 0, i.e., is always positive. On the other hand, the second term, &#8710; 2 , depends upon our prompter &#961; and can be controlled by designing a prompter. This observation leads to the formulation of an optimization problem for the prompter as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimization Problem for Prompter</head><p>We recall from the definition of &#8710; 2 that we would need to learn a &#961; such that &#8710; 2 is minimized. To achieve that, we recognize that the only term involving the prompter &#961;</p><p>and minimizing &#8710; 2 , we need to solve the following optimization problem</p><p>However, at the same time, since our prompter is also another language model, we will already have access to a baseline supervised fine-tuned prompter &#961; sft , and we want to ensure that our prompter &#961; * does not deviate significantly from &#961; sft , which motivates us to include a known and supervised fine-tuned prompter, denoted by &#961; sft . Thus, we solve the following optimization problem:</p><p>(9) We have introduced a KL-divergence-based regularizer above between the prompter &#961; and a reference supervised fine-tuned prompter &#961; sft . This helps with the development of a proper optimization problem with a closed-form expression and enables control over proximity to the initial prompter &#961; sft through the tuning parameter &#955;. We note that the formulation in (9) has also appeared in the red teaming literature for learning an attacker promoter <ref type="bibr">(Hong et al. 2024;</ref><ref type="bibr">Perez et al. 2022;</ref><ref type="bibr">Wichers, Denison, and Beirami 2024;</ref><ref type="bibr">Lee et al. 2024;</ref><ref type="bibr">Beetham et al. 2024)</ref>. Interpretation of &#955;. Another interesting interpretation of &#955; is that it controls the extent of prompt optimization we want to introduce into the pipeline, hence we also refer to it as the prompt tuning parameter. For instance, &#955; &#8594; &#8734; means no prompt optimization, while &#955; &#8594; 0, drives the optimization toward maximizing the prompter reward, albeit at the cost of deviating from &#961; sft which might be important in certain cases. Therefore, &#955; provides a meaningful trade-off, and its effects will be further elucidated in the following section.</p><p>The following Lemma 1 provides the optimal solution to the optimization problem (9).</p><p>, and &#955; &gt; 0 be the prompter tuning parameter. The optimal prompt distribution &#961; * that maximizes the objective function of the optimization problem (9) is given by:</p><p>where Z(x) is the log partition function given by</p><p>The proof is available in Appendix A <ref type="bibr">(Trivedi et al. 2025)</ref> and follows from the derivations in <ref type="bibr">Rafailov et al. (2023)</ref>. Next, we move on to answer Q2, in which we utilize the optimal prompter &#961; * (x &#8242; |x) to obtain a bound on the suboptimality gap. Notably, the integration of this optimal prompter with the frozen model will lead to the refined performance expressed in terms of the modified optimal policy &#960; * &#961; (y|x) = x &#8242; &#961; * (x &#8242; |x)&#960; F (y|x &#8242; ). This will capture the effectiveness of the prompt optimization process and offer insights into how closely the modified policy &#960; &#961; * approximates the true optimal policy &#960; * .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theoretical Insights w.r.t Fine-Tuning</head><p>We begin by establishing a bound on the suboptimality gap for the optimal prompter. The following theorem bounds the suboptimality gap J(&#960; * ) -J( &#960; &#961; * ) when the optimal prompter &#961; * as obtained in Lemma 1 is used. We present our result in Theorem 1 as follows. The detailed proof is available in Appendix B <ref type="bibr">(Trivedi et al. 2025)</ref>. Theorem 1. Let the optimal prompter &#961; * (x &#8242; |x) be given as in Equation (10). Then, the suboptimality gap is bounded as</p><p>(11) where P denotes the prompt distribution, &#955; is the prompter tuning parameter, and d T V is the total variation distance.</p><p>Theorem 1 provides an upper bound on the suboptimality gap between an optimal RLHF policy &#960; * and the optimal policy obtained by the prompt optimization approach &#960; &#961; * . We now provide the interpretations to each term of the suboptimality gap given in Theorem 1.</p><p>&#8226; Significance of first term in RHS of (11): The first term in Equation ( <ref type="formula">11</ref>) is always non-negative. It captures the intrinsic difficulty of obtaining the optimal RLHF policy via a prompt optimization setup when the frozen model is not fully aligned. We note that when &#960; F = &#960; * , the first term in Theorem 1 becomes zero. However, this scenario is not relevant to our prompt optimization framework, as it necessitates fine-tuning the frozen LLM. &#8226; Significance of second term in RHS of (11): This term measures how much the response distribution the frozen policy &#960; F changes when its input changes from x to x &#8242; under &#961; sft . For &#961; sft as delta distribution, this term will be zero, which essentially implies that this term is trying to capture the variation in the prompts (which should be minimal) due to the introduction of &#961; sft into the formulation.</p><p>&#8226; Significance of third term in RHS of (11): The third term captures the KL divergence between the optimal prompter &#961; * and the given prompter &#961; sft . This term is important because it explains that we can reduce the suboptimality bound via prompt optimization, which is making &#961; * far from &#961; sft , which can be controlled by the parameter &#955;.</p><p>Another interesting insight is that the upper bound on the suboptimality remains non-negative for</p><p>]. This essentially provide insight that in practice, with a budget of &#1013;1+&#1013;2 &#955; for the prompter optimization can be sufficient to achieve performance similar to RLHF based fine tuning. This further highlights that we won't need to choose an optimal prompter arbitrarily far from the base prompt distribution, thereby preventing a significant loss in the quality (e.g., perplexity) of the generated outputs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental Evaluations</head><p>In this section, we present proof of concept experiments to validate the theoretical insights of our proposed prompt optimization framework, which we named Align-Pro. We outline our experimental setup, including the dataset, model architecture, and evaluation metrics. Following this, we present our results and provide a detailed analysis of our findings. Additional details on the experimental setup and evaluations can be found in Appendix C <ref type="bibr">(Trivedi et al. 2025)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental Setup</head><p>We evaluate the performance of our Align-Pro using two distinct prompter models, denoted as P1 (Phi-3.5-Instruct) and P2 (Qwen-2.5-1.5B-Instruct), which modifies and updates the original prompt. Additionally, we use two frozen models, denoted as F1 (Llama-3.1-8B-Instruct) and F2 (Llama-3.1-8B-Instruct) to generate the final responses. This setup results in four unique model architectures, each representing a combination of the prompter and frozen models. For each architecture, we assess performance for the following three different configurations.</p><p>&#8226; No Fine-Tuning: In this configuration, the prompter is not used, and only the frozen model is used to generate responses without any fine-tuning or prompt modifications. &#8226; Align-Pro: In this setup, a fine-tuned prompter is placed before a frozen model. The prompter refines the input prompt, and the frozen model generates the response based on the optimized prompt. &#8226; RLHF: In this configuration, the frozen model undergoes fine-tuning through RLHF, and the response is generated directly from this fine-tuned model. Datasets: To capture the diversity in our experimental evaluations, we evaluate the performance over different datasets:</p><p>&#8226; UltraFeedback <ref type="bibr">(Cui et al. 2024</ref>) : A large-scale, high-quality, and diversified AI feedback dataset which contains feedback from user-assistant conversations from various aspects. This dataset evaluates the coherence of the prompt-response pairs. &#8226; HelpSteer <ref type="bibr">(Wang et al. 2023c</ref>): A multi-attribute helpfulness dataset annotated for correctness, coherence, complexity, and verbosity in addition to overall helpfulness of responses. &#8226; Orca <ref type="bibr">(Mukherjee et al. 2023</ref>): This dataset features responses with detailed explanations for each prompt, promoting thinking and effective instruction-following capabilities in the models.</p><p>Evaluation Criteria. The primary objective of our experiments is to optimize the input prompt to guide the frozen LLM that produces the desired response effectively. We fine-tune the prompter using proximal policy optimization (PPO) within the RLHF framework to achieve this. The reward signal for this fine-tuning process is derived from the quality of the enhanced prompt and the output generated by the frozen LLM. We assess the performance of Align-Pro based on three key metrics: mean reward, variance, and win-rate comparison against the no-fine-tuning baseline.</p><p>Computational Resources. Since we do not alter the parameters of the frozen model, our experiments require relatively fewer computational resources. Consequently, we were able to conduct all our experiments using a machine equipped with an INTEL(R) XEON(R) GOLD 6526Y processor with a Nvidia H100 GPU. We used Python 3.11 to execute the experiments. we used the PPOTrainer variant from Hugging Face TRL library to run the RLHF and Prompt Optimization pipeline experiments. Hyper-parameters. All of our experiments use the open-access TRL library, which is publicly available. The library can be accessed using the link<ref type="foot">foot_1</ref> . For our experiments, we do not perform any extra hyper-parameter tuning; rather, we use the parameters learning rate = 1.41e -5 given in the above-mentioned link. Moreover, we use the following generation configurations to generate the response for evaluation in all experiments: temperature = 1.5, top P = 0.6 and top K = 20.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>Mean reward and variance comparison: We calculate mean rewards and variances to assess the quality of preferred response generation and the diversity of the language model for all configurations and different model architectures. To associate the reward to each response, we use the available reward model<ref type="foot">foot_2</ref> , which scores the response. This reward model is trained to assign higher scores to the responses that comply with the off-target attributes.</p><p>We also compared Align-Pro with an oracle model, where the LLM is fine-tuned using RLHF. Figure <ref type="figure">2</ref> presents the mean rewards across all three datasets for each model configuration, while Figure <ref type="figure">3</ref> shows the corresponding reward variances. Interestingly, Align-Pro consistently outperforms the baseline (no fine-tuning) in terms of mean reward, demonstrating its ability to generate more preferred and stable responses, leveraging prompt optimization and getting close to the performance of fine-tuned model denoted by oracle. Moreover, the variance in reward for Align-Pro is the lowest, indicating that it produces more reliable and stable outputs. In each figure, we employ two prompters, denoted as P1 (Phi-3.5-Instruct) and P2 (Qwen-2.5-1.5B-Instruct), along with two frozen LLMs, denoted as F1 (Llama-3.1-8B-Instruct) and F2 (Llama-3.1-8B-Instruct).</p><p>Win rate comparison: We evaluate the performance of our Align-Pro method by comparing it to the no fine-tuning configuration using win rate as the primary performance metric. We rely on GPT-4 as an external, impartial judge to ensure unbiased evaluation. The evaluation criteria focus on critical aspects of the response: helpfulness, harmlessness, relevance, accuracy, depth, creativity, and level of detail. To update the prompt, we use a standardized system prompt template. Table <ref type="table">1</ref> presents the win rates for Align-Pro (denoted by A) against the no fine-tuning baseline (denoted by B). The results clearly show that, on average, Align-Pro significantly outperforms the no fine-tuning approach across all model architectures and datasets. These findings demonstrate the effectiveness of Align-Pro framework, which enhances performance by optimizing the input prompt while keeping the LLM frozen. Summary: Our experiments confirm that using a prompter alongside a frozen LLM significantly enhances alignment. Moreover, the expected reward and the win-rate differences are affected by the degree to which the prompter and frozen model align with human preferences. These experimental results, therefore, support our theoretical insights. We include several examples using the full prompt rewriting, illustrating the original prompt, the re-written prompt, and the corresponding final response in the Appendix C <ref type="bibr">(Trivedi et al. 2025)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Architectures</head><p>Prompter, Frozen LLM UltraFeed (win rate) HelpSteer (win rate) Orca (win rate) A B A B A B Phi-3.5-Instruct, Llama-3.1-8B-Instruct 60 24 46 37 63 26 Qwen-2.5-1.5B-Instruct, Llama-3.1-8B-Instruct 65 23 67 23 63 30 Phi-3.5-Instruct, Qwen-2.5-7B-Instruct 59 27 58 27 46 46 Qwen-2.5-1.5B-Instruct, Qwen-2.5-7B-Instruct 56 30 59 25 59 27</p><p>Table <ref type="table">1</ref>: The table presents the win rates (for 100 samples) of our Align-Pro method, denoted by A, compared to the baseline no fine-tuning method, denoted by B. A higher win rate indicates superior performance. Bolded numbers highlight the higher win rates. Across all model architectures and datasets, Align-Pro consistently outperforms the no fine-tuning baseline, demonstrating its effectiveness in improving response quality.</p><p>Remark 1. Our aim is not to present the best prompt optimizer but to develop an optimization framework for prompt optimization, which can help develop some theoretical insights into the performance of the prompt optimization approach. We seek to understand its theoretical performance relative to RLHF and fine-tuning methods, hence we did not compare our approach with other existing prompt optimization methods in the literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion, Limitations and Future Work</head><p>This work introduces an optimization framework for prompt optimization by utilizing a smaller, trainable model to generate optimized prompts for a frozen large language model (LLM). This approach reduces computational costs while preserving the LLM's pre-trained capabilities. We provide a closed-form expression for the optimal prompter and use it to establish an upper bound on the suboptimality gap that compares the optimized prompt policy with the standard RLHF policy. We demonstrate the effectiveness of our method on three datasets and various model configurations. In each scenario, we observe that Align-Pro is better in terms of the mean rewards and win rate compared to the baseline with no fine-tuning.</p><p>Limitations and future work: Our framework is inherently limited by the capabilities of the frozen language model. Another limitation includes the sensitivity of the prompt to the final response; a slight change in the prompt can lead to profound changes in the final responses. Theoretically, it would also be interesting to develop lower bounds on suboptimality and to develop further insights into the performance of prompt optimization. We will consider some of these issues as part of our future work. Some other potential future directions of our work include analyzing the robustness of the optimal prompter in the presence of noise in the frozen model and exploring the use of multiple prompters in sequence before inputting them into the frozen model.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>The Thirty-Ninth AAAI Conference on Artificial Intelligence </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>https://github.com/huggingface/trl/blob/main/examples/ notebooks/gpt2-sentiment.ipynb</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>https://huggingface.co/weqweasdas/RM-Gemma-2B</p></note>
		</body>
		</text>
</TEI>
