skip to main content

Title: Softmax policy gradient methods can take exponential time to converge

The softmax policy gradient (PG) method, which performs gradient ascent under softmax policy parameterization, is arguably one of the de facto implementations of policy optimization in modern reinforcement learning. For$$\gamma $$γ-discounted infinite-horizon tabular Markov decision processes (MDPs), remarkable progress has recently been achieved towards establishing global convergence of softmax PG methods in finding a near-optimal policy. However, prior results fall short of delineating clear dependencies of convergence rates on salient parameters such as the cardinality of the state space$${\mathcal {S}}$$Sand the effective horizon$$\frac{1}{1-\gamma }$$11-γ, both of which could be excessively large. In this paper, we deliver a pessimistic message regarding the iteration complexity of softmax PG methods, despite assuming access to exact gradient computation. Specifically, we demonstrate that the softmax PG method with stepsize$$\eta $$ηcan take$$\begin{aligned} \frac{1}{\eta } |{\mathcal {S}}|^{2^{\Omega \big (\frac{1}{1-\gamma }\big )}} ~\text {iterations} \end{aligned}$$1η|S|2Ω(11-γ)iterationsto converge, even in the presence of a benign policy initialization and an initial state distribution amenable to exploration (so that the distribution mismatch coefficient is not exceedingly large). This is accomplished by characterizing the algorithmic dynamics over a carefully-constructed MDP containing only three actions. Our exponential lower bound hints at the necessity of carefully adjusting update rules or enforcing proper regularization in more » accelerating PG methods.

« less
; ; ;
Publication Date:
Journal Name:
Mathematical Programming
Springer Science + Business Media
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    It has been recently established in David and Mayboroda (Approximation of green functions and domains with uniformly rectifiable boundaries of all dimensions.arXiv:2010.09793) that on uniformly rectifiable sets the Green function is almost affine in the weak sense, and moreover, in some scenarios such Green function estimates are equivalent to the uniform rectifiability of a set. The present paper tackles a strong analogue of these results, starting with the “flagship degenerate operators on sets with lower dimensional boundaries. We consider the elliptic operators$$L_{\beta ,\gamma } =- {\text {div}}D^{d+1+\gamma -n} \nabla $$Lβ,γ=-divDd+1+γ-nassociated to a domain$$\Omega \subset {\mathbb {R}}^n$$ΩRnwith a uniformly rectifiable boundary$$\Gamma $$Γof dimension$$d < n-1$$d<n-1, the now usual distance to the boundary$$D = D_\beta $$D=Dβgiven by$$D_\beta (X)^{-\beta } = \int _{\Gamma } |X-y|^{-d-\beta } d\sigma (y)$$Dβ(X)-β=Γ|X-y|-d-βdσ(y)for$$X \in \Omega $$XΩ, where$$\beta >0$$β>0and$$\gamma \in (-1,1)$$γ(-1,1). In this paper we show that the Green functionGfor$$L_{\beta ,\gamma }$$Lβ,γ, with pole at infinity, is well approximated by multiples of$$D^{1-\gamma }$$D1-γ, in the sense that the function$$\big | D\nabla \big (\ln \big ( \frac{G}{D^{1-\gamma }} \big )\big )\big |^2$$|D(ln(GD1-γ))|2satisfies a Carleson measure estimate on$$\Omega $$Ω. We underline that the strong and the weak results are different in nature and, of course, at the levelmore »of the proofs: the latter extensively used compactness arguments, while the present paper relies on some intricate integration by parts and the properties of the “magical distance function from David et al. (Duke Math J, to appear).

    « less
  2. Abstract

    The free multiplicative Brownian motion$$b_{t}$$btis the large-Nlimit of the Brownian motion on$$\mathsf {GL}(N;\mathbb {C}),$$GL(N;C),in the sense of$$*$$-distributions. The natural candidate for the large-Nlimit of the empirical distribution of eigenvalues is thus the Brown measure of$$b_{t}$$bt. In previous work, the second and third authors showed that this Brown measure is supported in the closure of a region$$\Sigma _{t}$$Σtthat appeared in the work of Biane. In the present paper, we compute the Brown measure completely. It has a continuous density$$W_{t}$$Wton$$\overline{\Sigma }_{t},$$Σ¯t,which is strictly positive and real analytic on$$\Sigma _{t}$$Σt. This density has a simple form in polar coordinates:$$\begin{aligned} W_{t}(r,\theta )=\frac{1}{r^{2}}w_{t}(\theta ), \end{aligned}$$Wt(r,θ)=1r2wt(θ),where$$w_{t}$$wtis an analytic function determined by the geometry of the region$$\Sigma _{t}$$Σt. We show also that the spectral measure of free unitary Brownian motion$$u_{t}$$utis a “shadow” of the Brown measure of$$b_{t}$$bt, precisely mirroring the relationship between the circular and semicircular laws. We develop several new methods, based on stochastic differential equations and PDE, to prove these results.

  3. Abstract

    Based on the recent development of the framework of Volterra rough paths (Harang and Tindel in Stoch Process Appl 142:34–78, 2021), we consider here the probabilistic construction of the Volterra rough path associated to the fractional Brownian motion with$$H>\frac{1}{2}$$H>12and for the standard Brownian motion. The Volterra kernelk(ts) is allowed to be singular, and behaving similar to$$|t-s|^{-\gamma }$$|t-s|-γfor some$$\gamma \ge 0$$γ0. The construction is done in both the Stratonovich and Itô senses. It is based on a modified Garsia–Rodemich–Romsey lemma which is of interest in its own right, as well as tools from Malliavin calculus. A discussion of challenges and potential extensions is provided.

  4. Abstract

    We study the structure of the Liouville quantum gravity (LQG) surfaces that are cut out as one explores a conformal loop-ensemble$$\hbox {CLE}_{\kappa '}$$CLEκfor$$\kappa '$$κin (4, 8) that is drawn on an independent$$\gamma $$γ-LQG surface for$$\gamma ^2=16/\kappa '$$γ2=16/κ. The results are similar in flavor to the ones from our companion paper dealing with$$\hbox {CLE}_{\kappa }$$CLEκfor$$\kappa $$κin (8/3, 4), where the loops of the CLE are disjoint and simple. In particular, we encode the combined structure of the LQG surface and the$$\hbox {CLE}_{\kappa '}$$CLEκin terms of stable growth-fragmentation trees or their variants, which also appear in the asymptotic study of peeling processes on decorated planar maps. This has consequences for questions that do a priori not involve LQG surfaces: In our paper entitled “CLE Percolations” described the law of interfaces obtained when coloring the loops of a$$\hbox {CLE}_{\kappa '}$$CLEκindependently into two colors with respective probabilitiespand$$1-p$$1-p. This description was complete up to one missing parameter$$\rho $$ρ. The results of the present paper about CLE on LQG allow us to determine its value in terms ofpand$$\kappa '$$κ. It shows in particular that$$\hbox {CLE}_{\kappa '}$$CLEκand$$\hbox {CLE}_{16/\kappa '}$$CLE16/κare related via a continuum analog of the Edwards-Sokal coupling between$$\hbox {FK}_q$$FKqpercolation and theq-state Potts model (which makes sense evenmore »for non-integerqbetween 1 and 4) if and only if$$q=4\cos ^2(4\pi / \kappa ')$$q=4cos2(4π/κ). This provides further evidence for the long-standing belief that$$\hbox {CLE}_{\kappa '}$$CLEκand$$\hbox {CLE}_{16/\kappa '}$$CLE16/κrepresent the scaling limits of$$\hbox {FK}_q$$FKqpercolation and theq-Potts model whenqand$$\kappa '$$κare related in this way. Another consequence of the formula for$$\rho (p,\kappa ')$$ρ(p,κ)is the value of half-plane arm exponents for such divide-and-color models (a.k.a. fuzzy Potts models) that turn out to take a somewhat different form than the usual critical exponents for two-dimensional models.

    « less
  5. Abstract

    A long-standing problem in mathematical physics is the rigorous derivation of the incompressible Euler equation from Newtonian mechanics. Recently, Han-Kwan and Iacobelli (Proc Am Math Soc 149:3045–3061, 2021) showed that in the monokinetic regime, one can directly obtain the Euler equation from a system ofNparticles interacting in$${\mathbb {T}}^d$$Td,$$d\ge 2$$d2, via Newton’s second law through asupercritical mean-field limit. Namely, the coupling constant$$\lambda $$λin front of the pair potential, which is Coulombic, scales like$$N^{-\theta }$$N-θfor some$$\theta \in (0,1)$$θ(0,1), in contrast to the usual mean-field scaling$$\lambda \sim N^{-1}$$λN-1. Assuming$$\theta \in (1-\frac{2}{d(d+1)},1)$$θ(1-2d(d+1),1), they showed that the empirical measure of the system is effectively described by the solution to the Euler equation as$$N\rightarrow \infty $$N. Han-Kwan and Iacobelli asked if their range for$$\theta $$θwas optimal. We answer this question in the negative by showing the validity of the incompressible Euler equation in the limit$$N\rightarrow \infty $$Nfor$$\theta \in (1-\frac{2}{d},1)$$θ(1-2d,1). Our proof is based on Serfaty’s modulated-energy method, but compared to that of Han-Kwan and Iacobelli, crucially uses an improved “renormalized commutator” estimate to obtain the larger range for$$\theta $$θ. Additionally, we show that for$$\theta \le 1-\frac{2}{d}$$θ1-2d, one cannot, in general, expect convergence in the modulated energy notion of distance.