The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Abstract discounted infinitehorizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a nearoptimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$$\gamma $$ $\gamma $ and the effective horizon$${\mathcal {S}}$$ $S$ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\frac{1}{1\gamma }$$ $\frac{1}{1\gamma}$ can take$$\eta $$ $\eta $ to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefullyconstructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.$$\begin{aligned} \frac{1}{\eta } {\mathcal {S}}^{2^{\Omega \big (\frac{1}{1\gamma }\big )}} ~\text {iterations} \end{aligned}$$ $\begin{array}{c}\frac{1}{\eta}{\leftS\right}^{{2}^{\Omega (\frac{1}{1\gamma})}}\phantom{\rule{0ex}{0ex}}\text{iterations}\end{array}$ 
Free, publiclyaccessible full text available May 7, 2025

This paper investigates a modelfree algorithm of broad interest in reinforcement learning, namely, Qlearning. Whereas substantial progress had been made toward understanding the sample efficiency of Qlearning in recent years, it remained largely unclear whether Qlearning is sampleoptimal and how to sharpen the sample complexity analysis of Qlearning. In this paper, we settle these questions: (1) When there is only a single action, we show that Qlearning (or, equivalently, TD learning) is provably minimax optimal. (2) When there are at least two actions, our theory unveils the strict suboptimality of Qlearning and rigorizes the negative impact of overestimation in Qlearning. Our theory accommodates both the synchronous case (i.e., the case in which independent samples are drawn) and the asynchronous case (i.e., the case in which one only has access to a single Markovian trajectory).
Free, publiclyaccessible full text available January 1, 2025 
This paper studies a central issue in modern reinforcement learning, the sample efficiency, and makes progress toward solving an idealistic scenario that assumes access to a generative model or a simulator. Despite a large number of prior works tackling this problem, a complete picture of the tradeoffs between sample complexity and statistical accuracy has yet to be determined. In particular, all prior results suffer from a severe sample size barrier in the sense that their claimed statistical guarantees hold only when the sample size exceeds some enormous threshold. The current paper overcomes this barrier and fully settles this problem; more specifically, we establish the minimax optimality of the modelbased approach for any given target accuracy level. To the best of our knowledge, this work delivers the first minimaxoptimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).
Free, publiclyaccessible full text available January 1, 2025 
Free, publiclyaccessible full text available January 1, 2025

Free, publiclyaccessible full text available June 30, 2024

Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finitehorizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved toward characterizing the minimaxoptimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memoryinefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $S^6A^4 \,\mathrm{poly}(H)$ for existing modelfree methods). To overcome such a large sample size barrier to efficient RL, we design a novel modelfree algorithm, with space complexity $O(SAH)$, that achieves nearoptimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burnin cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memoryefficient algorithm that is asymptotically regretoptimal. Leveraging the recently introduced variance reduction strategy (also called referenceadvantage decomposition), the proposed algorithm employs an earlysettled reference update rule, with the aid of two Qlearning sequences with upper and lower confidence bounds. The design principle of our earlysettled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation tradeoffs.more » « less