skip to main content

Title: Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm
Actor-critic style two-time-scale algorithms are one of the most popular methods in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the global convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory of samples. Our analysis applies to very general settings, as we only assume ergodicity of the underlying Markov decision process. In order to ensure enough exploration, we employ an ϵ-greedy sampling of the trajectory. For a fixed and small enough exploration parameter ϵ, we show that the two-time-scale natural actor-critic algorithm has a rate of convergence of O~(1/T1/4), where T is the number of samples, and this leads to a sample complexity of O~(1/δ8) samples to find a policy that is within an error of δ from the global optimum. Moreover, by carefully decreasing the exploration parameter ϵ as the iterations proceed, we present an improved sample complexity of O~(1/δ6) for convergence to the global optimum.  more » « less
Award ID(s):
2144316 1944993
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE Transactions on Automatic Control
Page Range / eLocation ID:
1 to 16
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm does not depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximation-based algorithms. 
    more » « less
  2. We propose a novel policy gradient method for multi-agent reinforcement learning, which leverages two different variance-reduction techniques and does not require large batches over iterations. Specifically, we propose a momentum-based decentralized policy gradient tracking (MDPGT) where a new momentum-based variance reduction technique is used to approximate the local policy gradient surrogate with importance sampling, and an intermediate parameter is adopted to track two consecutive policy gradient surrogates. MDPGT provably achieves the best available sample complexity of O(N -1 e -3) for converging to an e-stationary point of the global average of N local performance functions (possibly nonconcave). This outperforms the state-of-the-art sample complexity in decentralized model-free reinforcement learning and when initialized with a single trajectory, the sample complexity matches those obtained by the existing decentralized policy gradient methods. We further validate the theoretical claim for the Gaussian policy function. When the required error tolerance e is small enough, MDPGT leads to a linear speed up, which has been previously established in decentralized stochastic optimization, but not for reinforcement learning. Lastly, we provide empirical results on a multi-agent reinforcement learning benchmark environment to support our theoretical findings. 
    more » « less
  3. I. Farkaˇs et al. (Ed.)
    We propose a new deep deterministic actor-critic algorithm with an integrated network architecture and an integrated objective func- tion. We address stabilization of the learning procedure via a novel adap- tive objective that roughly ensures keeping the actor unchanged while the critic makes large errors. We reduce the number of network parame- ters and propose an improved exploration strategy over bounded action spaces. Moreover, we incorporate some recent advances in deep learn- ing to our algorithm. Experiments illustrate that our algorithm speeds up the learning process and reduces the sample complexity considerably over the state-of-the-art algorithms including TD3, SAC, PPO, and A2C in continuous control tasks. 
    more » « less
  4. This work studies an approach for computing provably robust control laws for robotic systems operating in uncertain environments. We develop an actor-critic style policy search algorithm based on the idea of minimizing an upper confidence bound on the negative expected advantage of a control policy at each policy update iteration. This new algorithm is a reformulation of Probably-Approximately-Correct Robust Policy Search (PROPS) and, unlike PROPS, allows for both step-based evaluation and step-based sampling strategies in policy parameter space, enabled by the use of Generalized Advantage Estimation and Generalized Exploration. As a result, the new algorithm is more data efficient and is expected to compute higher quality policies faster. We empirically evaluate the algorithm in simulation on a challenging robot navigation task using a high-fidelity deep stochastic model of an agile ground vehicle and compare its performance to the original trajectory-based PROPS 
    more » « less
  5. We present a fast, differentially private algorithm for high-dimensional covariance-aware mean estimation with nearly optimal sample complexity. Only exponential-time estimators were previously known to achieve this guarantee. Given n samples from a (sub-)Gaussian distribution with unknown mean μ and covariance Σ, our (ϵ,δ)-differentially private estimator produces μ~ such that ∥μ−μ~∥Σ≤α as long as n≳dα2+dlog1/δ√αϵ+dlog1/δϵ. The Mahalanobis error metric ∥μ−μ^∥Σ measures the distance between μ^ and μ relative to Σ; it characterizes the error of the sample mean. Our algorithm runs in time O~(ndω−1+nd/\eps), where ω<2.38 is the matrix multiplication exponent.We adapt an exponential-time approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above.Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With n≳d3/2 samples, our estimate is accurate in spectral norm. This is the first such algorithm using n=o(d2) samples, answering an open question posed by Alabi et al. (2022). With n≳d2 samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance.Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently. 
    more » « less