NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

MixUCB: Enhancing Safe Exploration in Contextual Bandits with Human Oversight [openreview]

Su, Jinyan; Banerjee, Rohan; Sun, Jiankai; Sun, Wen; Dean, Sarah (August 2025, Reinforcement Learning Journal 2025)

Free, publicly-accessible full text available August 11, 2026
MixUCB: Enhancing Safe Exploration in Contextual Bandits with Human Oversight

Su, Jinyan; Banerjee, Rohan; Sun, Jiankai; Sun, Wen; Dean, Sarah (August 2025, Reinforcement Learning Journal)

Free, publicly-accessible full text available August 1, 2026
ON SPEEDING UP LANGUAGE MODEL EVALUATION

Zhou, Jin Peng; Belardi, Christian K; Wu, Ruihan; Zhang, Travis; Gomes, Carla P; Sun, Wen; Weinberger, Kilian Q (June 2025, International Conference on Learning Representations)

Developing prompt-based methods with Large Language Models (LLMs) requires making numerous decisions, which give rise to a combinatorial search problem over hyper-parameters. This exhaustive evaluation can be time-consuming and costly. In this paper, we propose an adaptive approach to explore this space. We are exploiting the fact that often only few samples are needed to identify clearly superior or inferior settings, and that many evaluation tests are highly correlated. We lean on multi-armed bandits to sequentially identify the next (method, validation sample)-pair to evaluate and utilize low-rank matrix factorization to fill in missing evaluations. We carefully assess the efficacy of our approach on several competitive benchmark problems and show that it can identify the top-performing method using only 5-15% of the typical resources—resulting in 85-95% LLM cost savings. Our code is available at https://github.com/kilian-group/banditeval.
more » « less
Free, publicly-accessible full text available June 11, 2026
REBEL: Reinforcement Learning via Regressing Relative Rewards

Gao, Zhaolin; Chang, Jonathan; Zhan, Wenhao; Oertell, Owen; Swamy, Gokul; Brantley, Kianté; Joachims, Thorsten; Bagnell, J Andrew; Lee, Jason; Sun, Wen (December 2024, 38th Conference on Neural Information Processing Systems (NeurIPS 2024))

Free, publicly-accessible full text available December 15, 2025
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of the 41st International Conference on Machine Learning)

In this paper, we prove that Distributional Re- inforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second- order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learn- ing based optimism algorithm achieves a second- order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is rela- tively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtain- ing second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.
more » « less
Full Text Available
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of the 41st International Conference on Machine Learning)

In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second-order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learning based optimism algorithm achieves a second-order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is relatively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.
more » « less
Full Text Available
JoinGym: An Efficient Join Order Selection Environment

Wang, Junxiong; Wang, Kaiwen; Li, Yueying; Kallus, Nathan; Trummer, Immanuel; Sun, Wen (August 2024, Reinforcement Learning Journal)

Full Text Available
RL for Consistency Models: Faster Reward Guided Text-to-Image Generation

Oertell, Owen; Chang, Jonathan; Zhang, Yiyi; Brantley, Kiante; Sun, Wen (July 2024, Reinforcement Learning Conference)

Full Text Available
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of Machine Learning Research)

Full Text Available
JoinGym: An Efficient Join Order Selection Environment

Wang, Junxiong; Wang, Kaiwen; Li, Yueying; Kallus, Nathan; Trummer, Immanuel; Sun, Wen (July 2024, Reinforcement Learning Conference)

Full Text Available

« Prev Next »

Search for: All records