NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Softmax policy gradient methods can take exponential time to converge

https://doi.org/10.1007/s10107-022-01920-6

Li, Gen; Wei, Yuting; Chi, Yuejie; Chen, Yuxin (January 2023, Mathematical Programming)

Abstract The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For$$\gamma $$ $γ$ -discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$${\mathcal {S}}$$ $S$ and the effective horizon$$\frac{1}{1-\gamma }$$ $\frac{1}{1 - γ}$ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\eta $$ $η$ can take$$\begin{aligned} \frac{1}{\eta } |{\mathcal {S}}|^{2^{\Omega \big (\frac{1}{1-\gamma }\big )}} ~\text {iterations} \end{aligned}$$ $\begin{matrix} \frac{1}{η} {| S |}^{2^{Ω (\frac{1}{1 - γ})}} iterations \end{matrix}$ to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.
more » « less
Settling the Sample Complexity of Online Reinforcement Learning

https://doi.org/10.1145/3733592

Zhang, Zihan; Chen, Yuxin; Lee, Jason; Du, Simon S (May 2025, Journal of the ACM)

A central issue lying at the heart of online reinforcement learning (RL) is data efficiency. While a number of recent works achieved asymptotically minimal regret in online RL, the optimality of these results is only guaranteed in a “large-sample” regime, imposing enormous burn-in cost in order for their algorithms to operate optimally. How to achieve minimax-optimal regret without incurring any burn-in cost has been an open problem in RL theory. We settle this problem for finite-horizon inhomogeneous Markov decision processes. Specifically, we prove that a modified version ofMVP(Monotonic Value Propagation), an optimistic model-based algorithm proposed by Zhang et al. [82], achieves a regret on the order of (modulo log factors)\begin{equation*} \min \big \lbrace \sqrt {SAH^3K}, \,HK \big \rbrace,\end{equation*}whereSis the number of states,Ais the number of actions,His the horizon length, andKis the total number of episodes. This regret matches the minimax lower bound for the entire range of sample sizeK≥ 1, essentially eliminating any burn-in requirement. It also translates to a PAC sample complexity (i.e., the number of episodes needed to yield ε-accuracy) of$\frac{SAH^3}{\varepsilon ^2} $up to log factor, which is minimax-optimal for the full ε-range. Further, we extend our theory to unveil the influences of problem-dependent quantities like the optimal value/cost and certain variances. The key technical innovation lies in a novel analysis paradigm (based on a new concept called “profiles”) to decouple complicated statistical dependency across the sample trajectories — a long-standing challenge facing the analysis of online RL in the sample-starved regime.
more » « less
Free, publicly-accessible full text available May 2, 2026
Settling the Sample Complexity of Online Reinforcement Learning

Zhang, Zihan; Chen, Yuxin; Lee, Jason; Du, Simon (July 2024, Conference on Learning Theory)
Optimal Multi-Distribution Learning

Zhang, Zihan; Zhan, Wenhao; Chen, Yuxin; Du, Simon; Lee, Jason (July 2024, Conference on Learning Theory)
Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning

Li, Gen; Yan, Yuling; Chen, Yuxin; Fan, Jianqing (July 2024, Conference on Learning Theory)
Settling the sample complexity of model-based offline reinforcement learning

https://doi.org/10.1214/23-AOS2342

Li, Gen; Shi, Laixi; Chen, Yuxin; Chi, Yuejie; Wei, Yuting (February 2024, The Annals of Statistics)

Full Text Available
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model

https://doi.org/10.1287/opre.2023.2451

Li, Gen; Wei, Yuting; Chi, Yuejie; Chen, Yuxin (January 2024, Operations Research)

This paper studies a central issue in modern reinforcement learning, the sample efficiency, and makes progress toward solving an idealistic scenario that assumes access to a generative model or a simulator. Despite a large number of prior works tackling this problem, a complete picture of the trade-offs between sample complexity and statistical accuracy has yet to be determined. In particular, all prior results suffer from a severe sample size barrier in the sense that their claimed statistical guarantees hold only when the sample size exceeds some enormous threshold. The current paper overcomes this barrier and fully settles this problem; more specifically, we establish the minimax optimality of the model-based approach for any given target accuracy level. To the best of our knowledge, this work delivers the first minimax-optimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).
more » « less
Full Text Available
Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

https://doi.org/10.1287/opre.2023.2450

Li, Gen; Cai, Changxiao; Chen, Yuxin; Wei, Yuting; Chi, Yuejie (January 2024, Operations Research)

This paper investigates a model-free algorithm of broad interest in reinforcement learning, namely, Q-learning. Whereas substantial progress had been made toward understanding the sample efficiency of Q-learning in recent years, it remained largely unclear whether Q-learning is sample-optimal and how to sharpen the sample complexity analysis of Q-learning. In this paper, we settle these questions: (1) When there is only a single action, we show that Q-learning (or, equivalently, TD learning) is provably minimax optimal. (2) When there are at least two actions, our theory unveils the strict suboptimality of Q-learning and rigorizes the negative impact of overestimation in Q-learning. Our theory accommodates both the synchronous case (i.e., the case in which independent samples are drawn) and the asynchronous case (i.e., the case in which one only has access to a single Markovian trajectory).
more » « less
Full Text Available
Reward-Agnostic Fine-Tuning: Provable Statistical Benefits of Hybrid Reinforcement Learning

Li, Gen; Zhan, Wenhao; Lee, Jason; Chi, Yuejie; Chen, Yuxin (December 2023, Neural Information Processing Systems)
The Efficacy of Pessimism in Asynchronous Q-Learning

https://doi.org/10.1109/TIT.2023.3299840

Yan, Yuling; Li, Gen; Chen, Yuxin; Fan, Jianqing (November 2023, IEEE Transactions on Information Theory)

« Prev Next »

Search for: All records