NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Softmax policy gradient methods can take exponential time to converge

https://doi.org/10.1007/s10107-022-01920-6

Li, Gen; Wei, Yuting; Chi, Yuejie; Chen, Yuxin (January 2023, Mathematical Programming)

Abstract The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For$$\gamma $$ $γ$ -discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$${\mathcal {S}}$$ $S$ and the effective horizon$$\frac{1}{1-\gamma }$$ $\frac{1}{1 - γ}$ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\eta $$ $η$ can take$$\begin{aligned} \frac{1}{\eta } |{\mathcal {S}}|^{2^{\Omega \big (\frac{1}{1-\gamma }\big )}} ~\text {iterations} \end{aligned}$$ $\begin{matrix} \frac{1}{η} {| S |}^{2^{Ω (\frac{1}{1 - γ})}} iterations \end{matrix}$ to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.
more » « less
Settling the sample complexity of model-based offline reinforcement learning

https://doi.org/10.1214/23-AOS2342

Li, Gen; Shi, Laixi; Chen, Yuxin; Chi, Yuejie; Wei, Yuting (February 2024, The Annals of Statistics)

Full Text Available
Is Q-Learning Minimax Optimal? A Tight Sample Complexity Analysis

https://doi.org/10.1287/opre.2023.2450

Li, Gen; Cai, Changxiao; Chen, Yuxin; Wei, Yuting; Chi, Yuejie (January 2024, Operations Research)

This paper investigates a model-free algorithm of broad interest in reinforcement learning, namely, Q-learning. Whereas substantial progress had been made toward understanding the sample efficiency of Q-learning in recent years, it remained largely unclear whether Q-learning is sample-optimal and how to sharpen the sample complexity analysis of Q-learning. In this paper, we settle these questions: (1) When there is only a single action, we show that Q-learning (or, equivalently, TD learning) is provably minimax optimal. (2) When there are at least two actions, our theory unveils the strict suboptimality of Q-learning and rigorizes the negative impact of overestimation in Q-learning. Our theory accommodates both the synchronous case (i.e., the case in which independent samples are drawn) and the asynchronous case (i.e., the case in which one only has access to a single Markovian trajectory).
more » « less
Full Text Available
Policy Mirror Descent for Regularized Reinforcement Learning: A Generalized Framework with Linear Convergence

https://doi.org/10.1137/21M1456789

Zhan, Wenhao; Cen, Shicong; Huang, Baihe; Chen, Yuxin; Lee, Jason D.; Chi, Yuejie (June 2023, SIAM Journal on Optimization)

Full Text Available
Breaking the sample complexity barrier to regret-optimal model-free reinforcement learning

https://doi.org/10.1093/imaiai/iaac034

Li, Gen; Shi, Laixi; Chen, Yuxin; Chi, Yuejie (February 2023, Information and Inference: A Journal of the IMA)

Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $$S$$ states, $$A$$ actions and horizon length $$H$$, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of $$\sqrt{H^2SAT}$$ (modulo log factors) with $$T$$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $$S^6A^4 \,\mathrm{poly}(H)$$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $$SA\,\mathrm{poly}(H)$$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.
more » « less
Full Text Available
Sample Complexity of Asynchronous Q-Learning: Sharper Analysis and Variance Reduction

https://doi.org/10.1109/TIT.2021.3120096

Li, Gen; Wei, Yuting; Chi, Yuejie; Gu, Yuantao; Chen, Yuxin (January 2022, IEEE Transactions on Information Theory)

Full Text Available
Tackling Small Eigen-Gaps: Fine-Grained Eigenvector Estimation and Inference Under Heteroscedastic Noise

https://doi.org/10.1109/TIT.2021.3111828

Cheng, Chen; Wei, Yuting; Chen, Yuxin (November 2021, IEEE Transactions on Information Theory)
null (Ed.)
Full Text Available

Search for: All records