NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of the 41st International Conference on Machine Learning)

In this paper, we prove that Distributional Re- inforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second- order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learn- ing based optimism algorithm achieves a second- order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is rela- tively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtain- ing second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.
more » « less
Full Text Available
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of the 41st International Conference on Machine Learning)

In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second-order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learning based optimism algorithm achieves a second-order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is relatively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.
more » « less
Full Text Available
The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

Wang, Kaiwen; Zhou, Kevin; Wu, Runzhe; Kallus, Nathan; Sun, Wen (December 2023, 37th Conference on Neural Information Processing Systems (NeurIPS 2023))

While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the l1 distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.
more » « less
Full Text Available
Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Uehara, Masatoshi; Kiyohara, Haruka; Bennett, Andrew; Chernozhukov, Victor; Jiang, Nan; Kallus, Nathan; Shi, Chengchun; Sun, Wen (December 2023, Neural Information Processing Systems (NeurIPS 2023))

We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential im- portance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introduc- ing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states.
more » « less
Full Text Available
Future-Dependent Value-Based Off-Policy Evaluation in POMDPs

Uehara, Masatoshi; Kiyohara, Haruka; Bennett, Andrew; Chernozhukov, Victor; Jiang, Nan; Kallus, Nathan; Shi, Chengchun; Sun, Wen (December 2023, 37th Conference on Neural Information Processing Systems (NeurIPS 2023))

Full Text Available
The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

Wang, Kaiwen; Zhou, Kevin; Wu, Runzhe; Kallus, Nathan; Sun, Wen (December 2023, 37th Conference on Neural Information Processing Systems (NeurIPS 2023))

Full Text Available
Multi-task Representation Learning for Pure Exploration in Linear Bandits

Du, Yihan; Huang, Longbo; Sun, Wen (July 2023, International Conference on Machine Learning)

Full Text Available
Provable Benefits of Representational Transfer in Reinforcement Learning

Agarwal, Alekh; Song, Yuda; Sun, Wen; Wang, Kaiwen; Wang, Mengdi; Zhang, Xuezhou (July 2023, The Conference on Learning Theory)

Full Text Available
HYBRID RL: USING BOTH OFFLINE AND ONLINE DATA CAN MAKE RL EFFICIENT

Song, Yuda; Zhou, Yifei; Sekhari, Ayush; Bagnell, J.Andrew; Krishnamurthy, Akshay; Sun, Wen (May 2023, International Conference on Representation Learning)

Full Text Available

Search for: All records