The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Abstract discounted infinitehorizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a nearoptimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$$\gamma $$ $\gamma $ and the effective horizon$${\mathcal {S}}$$ $S$ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\frac{1}{1\gamma }$$ $\frac{1}{1\gamma}$ can take$$\eta $$ $\eta $ to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefullyconstructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.$$\begin{aligned} \frac{1}{\eta } {\mathcal {S}}^{2^{\Omega \big (\frac{1}{1\gamma }\big )}} ~\text {iterations} \end{aligned}$$ $\begin{array}{c}\frac{1}{\eta}{\leftS\right}^{{2}^{\Omega (\frac{1}{1\gamma})}}\phantom{\rule{0ex}{0ex}}\text{iterations}\end{array}$ 
Free, publiclyaccessible full text available July 1, 2025

Free, publiclyaccessible full text available April 1, 2025

Free, publiclyaccessible full text available February 1, 2025

This paper investigates a modelfree algorithm of broad interest in reinforcement learning, namely, Qlearning. Whereas substantial progress had been made toward understanding the sample efficiency of Qlearning in recent years, it remained largely unclear whether Qlearning is sampleoptimal and how to sharpen the sample complexity analysis of Qlearning. In this paper, we settle these questions: (1) When there is only a single action, we show that Qlearning (or, equivalently, TD learning) is provably minimax optimal. (2) When there are at least two actions, our theory unveils the strict suboptimality of Qlearning and rigorizes the negative impact of overestimation in Qlearning. Our theory accommodates both the synchronous case (i.e., the case in which independent samples are drawn) and the asynchronous case (i.e., the case in which one only has access to a single Markovian trajectory).
Free, publiclyaccessible full text available January 1, 2025 
We study a noisy tensor completion problem of broad practical interest, namely, the reconstruction of a lowrank tensor from highly incomplete and randomly corrupted observations of its entries. Whereas a variety of prior work has been dedicated to this problem, prior algorithms either are computationally too expensive for largescale applications or come with suboptimal statistical guarantees. Focusing on “incoherent” and wellconditioned tensors of a constant canonical polyadic rank, we propose a twostage nonconvex algorithm—(vanilla) gradient descent following a rough initialization—that achieves the best of both worlds. Specifically, the proposed nonconvex algorithm faithfully completes the tensor and retrieves all individual tensor factors within nearly linear time, while at the same time enjoying nearoptimal statistical guarantees (i.e., minimal sample complexity and optimal estimation accuracy). The estimation errors are evenly spread out across all entries, thus achieving optimal [Formula: see text] statistical accuracy. We also discuss how to extend our approach to accommodate asymmetric tensors. The insight conveyed through our analysis of nonconvex optimization might have implications for other tensor estimation problems.more » « less