An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization

Kiyohara, Haruka (ORCID:0009000063784365); Cao, Daniel Yiming (ORCID:0009000652945542); Saito, Yuta (ORCID:0000000343575835); Joachims, Thorsten (ORCID:0000000336543683)

doi:10.1145/3705328.3748088

Citation Details

This content will become publicly available on September 7, 2026

An Off-Policy Learning Approach for Steering Sentence Generation towards Personalization

We study the problem of personalizing the output of a large language model (LLM) by training on logged bandit feedback (e.g., personalizing movie descriptions based on likes). While one may naively treat this as a standard off-policy contextual bandit problem, the large action space and the large parameter space make naive applications of off-policy learning (OPL) infeasible. We overcome this challenge by learning a prompt policy for a frozen LLM that has only a modest number of parameters. The proposed Direct Sentence Off-policy gradient (DSO) effectively propagates the gradient to the prompt policy space by leveraging the smoothness and overlap in the sentence space. Consequently, DSO substantially reduces variance while also suppressing bias. Empirical results on our newly established suite of benchmarks, called OfflinePrompts, demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts and reward noise are large. more »

Award ID(s):: 2312865

PAR ID:: 10660108

Author(s) / Creator(s):: Kiyohara, Haruka; Cao, Daniel Yiming; Saito, Yuta; Joachims, Thorsten

Publisher / Repository:: ACM

Date Published:: 2025-09-07

Edition / Version:: 2025

Page Range / eLocation ID:: 41 - 50

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on September 7, 2026
Conference Paper:
https://doi.org/10.1145/3705328.3748088

More Like this