skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Softmax policy gradient methods can take exponential time to converge
Abstract The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For$$\gamma $$ γ -discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$${\mathcal {S}}$$ S and the effective horizon$$\frac{1}{1-\gamma }$$ 1 1 - γ , both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\eta $$ η can take$$\begin{aligned} \frac{1}{\eta } |{\mathcal {S}}|^{2^{\Omega \big (\frac{1}{1-\gamma }\big )}} ~\text {iterations} \end{aligned}$$ 1 η | S | 2 Ω ( 1 1 - γ ) iterations to converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in accelerating PG methods.  more » « less
Award ID(s):
2143215 2106778 2106739 1907661 2218773 2221009 2218713 1900140 2100158 2014279
PAR ID:
10392524
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Mathematical Programming
Volume:
201
Issue:
1-2
ISSN:
0025-5610
Format(s):
Medium: X Size: p. 707-802
Size(s):
p. 707-802
Sponsoring Org:
National Science Foundation
More Like this
  1. A<sc>bstract</sc> A search for the fully reconstructed$$ {B}_s^0 $$ B s 0 → μ+μγdecay is performed at the LHCb experiment using proton-proton collisions at$$ \sqrt{s} $$ s = 13 TeV corresponding to an integrated luminosity of 5.4 fb−1. No significant signal is found and upper limits on the branching fraction in intervals of the dimuon mass are set$$ {\displaystyle \begin{array}{cc}\mathcal{B}\left({B}_s^0\to {\mu}^{+}{\mu}^{-}\gamma \right)<4.2\times {10}^{-8},& m\left({\mu}^{+}{\mu}^{-}\right)\in \left[2{m}_{\mu },1.70\right]\textrm{GeV}/{c}^2,\\ {}\mathcal{B}\left({B}_s^0\to {\mu}^{+}{\mu}^{-}\gamma \right)<7.7\times {10}^{-8},&\ m\left({\mu}^{+}{\mu}^{-}\right)\in \left[\textrm{1.70,2.88}\right]\textrm{GeV}/{c}^2,\\ {}\mathcal{B}\left({B}_s^0\to {\mu}^{+}{\mu}^{-}\gamma \right)<4.2\times {10}^{-8},& m\left({\mu}^{+}{\mu}^{-}\right)\in \left[3.92,{m}_{B_s^0}\right]\textrm{GeV}/{c}^2,\end{array}} $$ B B s 0 μ + μ γ < 4.2 × 10 8 , m μ + μ 2 m μ 1.70 GeV / c 2 , B B s 0 μ + μ γ < 7.7 × 10 8 , m μ + μ 1.70, 2.88 GeV / c 2 , B B s 0 μ + μ γ < 4.2 × 10 8 , m μ + μ 3.92 m B s 0 GeV / c 2 , at 95% confidence level. Additionally, upper limits are set on the branching fraction in the [2mμ,1.70] GeV/c2dimuon mass region excluding the contribution from the intermediateϕ(1020) meson, and in the region combining all dimuon-mass intervals. 
    more » « less
  2. Abstract We consider the following stochastic heat equation$$\begin{aligned} \partial _t u(t,x) = \tfrac{1}{2} \partial ^2_x u(t,x) + b(u(t,x)) + \sigma (u(t,x)) {\dot{W}}(t,x), \end{aligned}$$ t u ( t , x ) = 1 2 x 2 u ( t , x ) + b ( u ( t , x ) ) + σ ( u ( t , x ) ) W ˙ ( t , x ) , defined for$$(t,x)\in (0,\infty )\times {\mathbb {R}}$$ ( t , x ) ( 0 , ) × R , where$${\dot{W}}$$ W ˙ denotes space-time white noise. The function$$\sigma $$ σ is assumed to be positive, bounded, globally Lipschitz, and bounded uniformly away from the origin, and the functionbis assumed to be positive, locally Lipschitz and nondecreasing. We prove that the Osgood condition$$\begin{aligned} \int _1^\infty \frac{\textrm{d}y}{b(y)}<\infty \end{aligned}$$ 1 d y b ( y ) < implies that the solution almost surely blows up everywhere and instantaneously, In other words, the Osgood condition ensures that$$\textrm{P}\{ u(t,x)=\infty \quad \hbox { for all } t>0 \hbox { and } x\in {\mathbb {R}}\}=1.$$ P { u ( t , x ) = for all t > 0 and x R } = 1 . The main ingredients of the proof involve a hitting-time bound for a class of differential inequalities (Remark 3.3), and the study of the spatial growth of stochastic convolutions using techniques from the Malliavin calculus and the Poincaré inequalities that were developed in Chen et al. (Electron J Probab 26:1–37, 2021, J Funct Anal 282(2):109290, 2022). 
    more » « less
  3. Abstract The free multiplicative Brownian motion$$b_{t}$$ b t is the large-Nlimit of the Brownian motion on$$\mathsf {GL}(N;\mathbb {C}),$$ GL ( N ; C ) , in the sense of$$*$$ -distributions. The natural candidate for the large-Nlimit of the empirical distribution of eigenvalues is thus the Brown measure of$$b_{t}$$ b t . In previous work, the second and third authors showed that this Brown measure is supported in the closure of a region$$\Sigma _{t}$$ Σ t that appeared in the work of Biane. In the present paper, we compute the Brown measure completely. It has a continuous density$$W_{t}$$ W t on$$\overline{\Sigma }_{t},$$ Σ ¯ t , which is strictly positive and real analytic on$$\Sigma _{t}$$ Σ t . This density has a simple form in polar coordinates:$$\begin{aligned} W_{t}(r,\theta )=\frac{1}{r^{2}}w_{t}(\theta ), \end{aligned}$$ W t ( r , θ ) = 1 r 2 w t ( θ ) , where$$w_{t}$$ w t is an analytic function determined by the geometry of the region$$\Sigma _{t}$$ Σ t . We show also that the spectral measure of free unitary Brownian motion$$u_{t}$$ u t is a “shadow” of the Brown measure of$$b_{t}$$ b t , precisely mirroring the relationship between the circular and semicircular laws. We develop several new methods, based on stochastic differential equations and PDE, to prove these results. 
    more » « less
  4. Abstract Let$$(h_I)$$ ( h I ) denote the standard Haar system on [0, 1], indexed by$$I\in \mathcal {D}$$ I D , the set of dyadic intervals and$$h_I\otimes h_J$$ h I h J denote the tensor product$$(s,t)\mapsto h_I(s) h_J(t)$$ ( s , t ) h I ( s ) h J ( t ) ,$$I,J\in \mathcal {D}$$ I , J D . We consider a class of two-parameter function spaces which are completions of the linear span$$\mathcal {V}(\delta ^2)$$ V ( δ 2 ) of$$h_I\otimes h_J$$ h I h J ,$$I,J\in \mathcal {D}$$ I , J D . This class contains all the spaces of the formX(Y), whereXandYare either the Lebesgue spaces$$L^p[0,1]$$ L p [ 0 , 1 ] or the Hardy spaces$$H^p[0,1]$$ H p [ 0 , 1 ] ,$$1\le p < \infty $$ 1 p < . We say that$$D:X(Y)\rightarrow X(Y)$$ D : X ( Y ) X ( Y ) is a Haar multiplier if$$D(h_I\otimes h_J) = d_{I,J} h_I\otimes h_J$$ D ( h I h J ) = d I , J h I h J , where$$d_{I,J}\in \mathbb {R}$$ d I , J R , and ask which more elementary operators factor throughD. A decisive role is played by theCapon projection$$\mathcal {C}:\mathcal {V}(\delta ^2)\rightarrow \mathcal {V}(\delta ^2)$$ C : V ( δ 2 ) V ( δ 2 ) given by$$\mathcal {C} h_I\otimes h_J = h_I\otimes h_J$$ C h I h J = h I h J if$$|I|\le |J|$$ | I | | J | , and$$\mathcal {C} h_I\otimes h_J = 0$$ C h I h J = 0 if$$|I| > |J|$$ | I | > | J | , as our main result highlights: Given any bounded Haar multiplier$$D:X(Y)\rightarrow X(Y)$$ D : X ( Y ) X ( Y ) , there exist$$\lambda ,\mu \in \mathbb {R}$$ λ , μ R such that$$\begin{aligned} \lambda \mathcal {C} + \mu ({{\,\textrm{Id}\,}}-\mathcal {C})\text { approximately 1-projectionally factors through }D, \end{aligned}$$ λ C + μ ( Id - C ) approximately 1-projectionally factors through D , i.e., for all$$\eta > 0$$ η > 0 , there exist bounded operatorsA, Bso thatABis the identity operator$${{\,\textrm{Id}\,}}$$ Id ,$$\Vert A\Vert \cdot \Vert B\Vert = 1$$ A · B = 1 and$$\Vert \lambda \mathcal {C} + \mu ({{\,\textrm{Id}\,}}-\mathcal {C}) - ADB\Vert < \eta $$ λ C + μ ( Id - C ) - A D B < η . Additionally, if$$\mathcal {C}$$ C is unbounded onX(Y), then$$\lambda = \mu $$ λ = μ and then$${{\,\textrm{Id}\,}}$$ Id either factors throughDor$${{\,\textrm{Id}\,}}-D$$ Id - D
    more » « less
  5. Abstract We consider integral area-minimizing 2-dimensional currents$$T$$ T in$$U\subset \mathbf {R}^{2+n}$$ U R 2 + n with$$\partial T = Q\left [\!\![{\Gamma }\right ]\!\!]$$ T = Q Γ , where$$Q\in \mathbf {N} \setminus \{0\}$$ Q N { 0 } and$$\Gamma $$ Γ is sufficiently smooth. We prove that, if$$q\in \Gamma $$ q Γ is a point where the density of$$T$$ T is strictly below$$\frac{Q+1}{2}$$ Q + 1 2 , then the current is regular at$$q$$ q . The regularity is understood in the following sense: there is a neighborhood of$$q$$ q in which$$T$$ T consists of a finite number of regular minimal submanifolds meeting transversally at$$\Gamma $$ Γ (and counted with the appropriate integer multiplicity). In view of well-known examples, our result is optimal, and it is the first nontrivial generalization of a classical theorem of Allard for$$Q=1$$ Q = 1 . As a corollary, if$$\Omega \subset \mathbf {R}^{2+n}$$ Ω R 2 + n is a bounded uniformly convex set and$$\Gamma \subset \partial \Omega $$ Γ Ω a smooth 1-dimensional closed submanifold, then any area-minimizing current$$T$$ T with$$\partial T = Q \left [\!\![{\Gamma }\right ]\!\!]$$ T = Q Γ is regular in a neighborhood of $$\Gamma $$ Γ
    more » « less