NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Confronting Reward Model Overoptimization with Constrained RLHF

Moskovitz, T; Singh, A; Strouse, DJ; Sandholm, T; Salakhutdinov, R; Dragan, A; McAleer, S (May 2024, ICLR)

Full Text Available
Confronting Reward Model Overoptimization with Constrained RLHF

Moskovitz, T; Singh, A; Strouse, DJ; Sandholm, T; Salakhutdinov, R; Dragan, A; McAleer, S (May 2024, ICLR24)

Large language models are typically aligned with human preferences by optimizing reward models (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to overoptimization, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM’s threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.
more » « less
Full Text Available
Steering No-Regret Learners to a Desired Equilibrium

Zhang, B; Farina, G; Anagostides, I; Cacciamani, F; McAleer, S; Haupt, A; Celli, A; Gatti, N; Conitzer, V; Sandholm, T (July 2024, EC24)

A mediator observes no-regret learners playing an extensive-form game repeatedly across T rounds. The mediator attempts to steer players toward some desirable predetermined equilibrium by giving (nonnegative) payments to players. We call this the steering problem. The steering problem captures problems several problems of interest, among them equilibrium selection and information design (persuasion). If the mediator’s budget is unbounded, steering is trivial because the mediator can simply pay the players to play desirable actions. We study two bounds on the mediator’s payments: a total budget and a per-round budget. If the mediator’s total budget does not grow with T, we show that steering is impossible. However, we show that it is enough for the total budget to grow sublinearly with T, that is, for the average payment to vanish. When players’ full strategies are observed at each round, we show that constant per-round budgets permit steering. In the more challenging setting where only trajectories through the game tree are observable, we show that steering is impossible with constant per-round budgets in general extensive-form games, but possible in normal-form games or if the per-round budget may itself depend on T. We also show how our results can be generalized to the case when the equilibrium is being computed online while steering is happening. We supplement our theoretical positive results with experiments highlighting the efficacy of steering in large games.
more » « less
Full Text Available
Automated Design of Affine Maximizer Mechanisms In Dynamic Settings

Curry, M; Thoma, V; Chakrabarti, D; McAleer, S; Kroer, C; Sandholm, T; He, N; Seuken, S (February 2024, AAAI)

Full Text Available
Automated Design of Affine Maximizer Mechanisms In Dynamic Settings

Curry, M; Thoma, V; Chakrabarti, D; McAleer, S; Kroer, C; Sandholm, T; He, N; Seuken, A (February 2024, AAAI24)

Dynamic mechanism design is a challenging extension to ordinary mechanism design in which the mechanism designer must make a sequence of decisions over time in the face of possibly untruthful reports of participating agents. Optimizing dynamic mechanisms for welfare is relatively well understood. However, there has been less work on optimizing for other goals (e.g. revenue), and without restrictive assumptions on valuations, it is remarkably challenging to characterize good mechanisms. Instead, we turn to automated mechanism design to find mechanisms with good performance in specific problem instances.We extend the class of affine maximizer mechanisms to MDPs where agents may untruthfully report their rewards. This extension results in a challenging bilevel optimization problem in which the upper problem involves choosing optimal mechanism parameters, and the lower problem involves solving the resulting MDP. Our approach can find truthful dynamic mechanisms that achieve strong performance on goals other than welfare, and can be applied to essentially any problem setting—without restrictions on valuations—for which RL can learn optimal policies.
more » « less
Full Text Available
Team-PSRO for Learning Approximate TMECor in Large Team Games via Cooperative Reinforcement Learning

McAleer, S; Farina, G; Zhou, G; Wang, M; Yang, Y; Sandholm, T (December 2023, NeurIPS)

Full Text Available
Team-PSRO for Learning Approximate TMECor in Large Team Games via Cooperative Reinforcement Learning

McAleer, S; Farina, G; Zhou, G; Wang, M; Yang, Y; Sandholm, T (December 2023, NeurIPS23)

Recent algorithms have achieved superhuman performance at a number of twoplayer zero-sum games such as poker and go. However, many real-world situations are multi-player games. Zero-sum two-team games, such as bridge and football, involve two teams where each member of the team shares the same reward with every other member of that team, and each team has the negative of the reward of the other team. A popular solution concept in this setting, called TMECor, assumes that teams can jointly correlate their strategies before play, but are not able to communicate during play. This setting is harder than two-player zerosum games because each player on a team has different information and must use their public actions to signal to other members of the team. Prior works either have game-theoretic guarantees but only work in very small games, or are able to scale to large games but do not have game-theoretic guarantees. In this paper we introduce two algorithms: Team-PSRO, an extension of PSRO from twoplayer games to team games, and Team-PSRO Mix-and-Match which improves upon Team PSRO by better using population policies. In Team-PSRO, in every iteration both teams learn a joint best response to the opponent’s meta-strategy via reinforcement learning. As the reinforcement learning joint best response approaches the optimal best response, Team-PSRO is guaranteed to converge to a TMECor. In experiments on Kuhn poker and Liar’s Dice, we show that a tabular version of Team-PSRO converges to TMECor, and a version of Team PSRO using deep cooperative reinforcement learning beats self-play reinforcement learning in the large game of Google Research Football.
more » « less
Full Text Available
Computing Optimal Equilibria and Mechanisms via Learning in Zero-Sum Extensive-Form Games

Zhang, B; Farina, G; Anagnostides, I; Cacciamani, F; McAleer, S; Haupt, A; Celli, A; Gatti, N; Conitzer, V; Sandholm, T (December 2023, NeurIPS)

Full Text Available
Computing Optimal Equilibria and Mechanisms via Learning in Zero-Sum Extensive-Form Games

Zhang, B; Farina, G; Anagnostides, I; Cacciamani, F; McAleer, S; Haupt, A; Celli, A; Gatti, N; Conitzer, V; Sandholm, T (December 2023, NeurIPS23)

We introduce a new approach for computing optimal equilibria and mechanisms via learning in games. It applies to extensive-form settings with any number of players, including mechanism design, information design, and solution concepts such as correlated, communication, and certification equilibria. We observe that optimal equilibria are minimax equilibrium strategies of a player in an extensiveform zero-sum game. This reformulation allows us to apply techniques for learning in zero-sum games, yielding the first learning dynamics that converge to optimal equilibria, not only in empirical averages, but also in iterates. We demonstrate the practical scalability and flexibility of our approach by attaining state-of-the-art performance in benchmark tabular games, and by computing an optimal mechanism for a sequential auction design problem using deep reinforcement learning.
more » « less
Full Text Available
Solving the Rubik's Cube with Approximate Policy Iteration

McAleer, S.; Agostinelli, F.; Shmakov, A. K.; Baldi, P. (January 2019, International Conference on Learning Representations)

Recently, Approximate Policy Iteration (API) algorithms have achieved superhuman proficiency in two-player zero-sum games such as Go, Chess, and Shogi without human data. These API algorithms iterate between two policies: a slow policy (tree search), and a fast policy (a neural network). In these two-player games, a reward is always received at the end of the game. However, the Rubik’s Cube has only a single solved state, and episodes are not guaranteed to terminate. This poses a major problem for these API algorithms since they rely on the reward received at the end of the game. We introduce Autodidactic Iteration: an API algorithm that overcomes the problem of sparse rewards by training on a distribution of states that allows the reward to propagate from the goal state to states farther away. Autodidactic Iteration is able to learn how to solve the Rubik’s Cube without relying on human data. Our algorithm is able to solve 100% of randomly scrambled cubes while achieving a median solve length of 30 moves — less than or equal to solvers that employ human domain knowledge.
more » « less
Full Text Available

« Prev Next »

Search for: All records