skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Learning the model-free linear quadratic regulator via random search
Model-free reinforcement learning attempts to find an optimal control action for an unknown dynamical system by directly searching over the parameter space of controllers. The convergence behavior and statistical properties of these approaches are often poorly understood because of the nonconvex nature of the underlying optimization problems as well as the lack of exact gradient computation. In this paper, we examine the standard infinite-horizon linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We provide theoretical bounds on the convergence rate and sample complexity of a random search method. Our results demonstrate that the required simulation time for achieving πœ–-accuracy in a model-free setup and the total number of function evaluations are both of 𝑂(log(1/πœ–)).  more » « less
Award ID(s):
1846369 1813877
PAR ID:
10200051
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 2nd Conference on Learning for Dynamics and Control.
Volume:
120
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Many emerging applications involve control of systems with unknown dynamics. As a result, model-free random search techniques that directly search over the space of parameters have become popular. These algorithms often exhibit a competitive sample complexity compared to state-of- the-art techniques. However, due to the nonconvex nature of the underlying optimization problems, the convergence behavior and statistical properties of these approaches are poorly understood. In this paper, we examine the standard linear quadratic regulator problem for continuous-time systems with unknown state-space parameters. We establish theoretical bounds on the sample complexity and prove the linear convergence rate of the random search method. 
    more » « less
  2. The authors provide the first tight sample complexity bounds for shadow tomography and classical shadows in the regime where the target error is below some sufficiently small inverse polynomial in the dimension of the Hilbert space. Specifically, they present a protocol that, given any π‘š ∈ 𝑁 m∈N and πœ– ≀ 𝑂 ( 𝑑 βˆ’ 1 / 2 ) ϡ≀O(d βˆ’1/2 ), measures 𝑂 ( log ⁑ ( π‘š ) / πœ– 2 ) O(log(m)/Ο΅ 2 ) copies of an unknown mixed state 𝜌 ∈ 𝐢 𝑑 Γ— 𝑑 ρ∈C dΓ—d and outputs a classical description of 𝜌 ρ. This description can then be used to estimate any collection of π‘š m observables to within additive accuracy πœ– Ο΅. Previously, even for the simpler case of shadow tomography where observables are known in advance, the best known rates either scaled benignly but suboptimally in all of π‘š , 𝑑 , πœ– m,d,Ο΅, or scaled optimally in πœ– , π‘š Ο΅,m but included additional polynomial factors in 𝑑 d. Interestingly, the authors also show via dimensionality reduction that one can rescale πœ– Ο΅ and 𝑑 d to reduce to the regime where πœ– ≀ 𝑂 ( 𝑑 βˆ’ 1 / 2 ) ϡ≀O(d βˆ’1/2 ). Their algorithm draws on representation-theoretic tools developed in the context of full state tomography. 
    more » « less
  3. We consider model-free reinforcement learning for infinite-horizon discounted Markov Decision Processes (MDPs) with a continuous state space and unknown transition kernel, when only a single sample path under an arbitrary policy of the system is available. We consider the Nearest Neighbor Q-Learning (NNQL) algorithm to learn the optimal Q function using nearest neighbor regression method. As the main contribution, we provide tight finite sample analysis of the convergence rate. In particular, for MDPs with a d-dimensional state space and the discounted factor in (0, 1), given an arbitrary sample path with β€œcovering time” L, we establish that the algorithm is guaranteed to output an "-accurate estimate of the optimal Q-function nearly optimal sample complexity. 
    more » « less
  4. The level set estimation problem seeks to find all points in a domain  where the value of an unknown function 𝑓:→ℝ exceeds a threshold 𝛼 . The estimation is based on noisy function evaluations that may be acquired at sequentially and adaptively chosen locations in  . The threshold value 𝛼 can either be explicit and provided a priori, or implicit and defined relative to the optimal function value, i.e. 𝛼=(1βˆ’πœ–)𝑓(π±βˆ—) for a given πœ–>0 where 𝑓(π±βˆ—) is the maximal function value and is unknown. In this work we provide a new approach to the level set estimation problem by relating it to recent adaptive experimental design methods for linear bandits in the Reproducing Kernel Hilbert Space (RKHS) setting. We assume that 𝑓 can be approximated by a function in the RKHS up to an unknown misspecification and provide novel algorithms for both the implicit and explicit cases in this setting with strong theoretical guarantees. Moreover, in the linear (kernel) setting, we show that our bounds are nearly optimal, namely, our upper bounds match existing lower bounds for threshold linear bandits. To our knowledge this work provides the first instance-dependent, non-asymptotic upper bounds on sample complexity of level-set estimation that match information theoretic lower bounds. 
    more » « less
  5. The robust πœ™-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust πœ™-regularized fitted Q-iteration for learning an πœ–-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of πœ™-divergences achieving robust optimal policies in high-dimensional systems of arbitrary large state space with general function approximation. Second, we introduce the hybrid robust πœ™-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration. To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems of arbitrary large state space with general function approximation under the hybrid robust πœ™-regularized reinforcement learning framework. 
    more » « less