skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: Reinforcement Learning for Adaptive Periodic Linear Quadratic Control
This paper presents a first solution to the problem of adaptive LQR for continuous-time linear periodic systems. Specifically, reinforcement learning and adaptive dynamic programming (ADP) techniques are used to develop two algorithms to obtain near-optimal controllers. Firstly, the policy iteration (PI) and value iteration (VI) methods are proposed when the model is known. Then, PI-based and VI-based off-policy ADP algorithms are derived to find near-optimal solutions directly from input/state data collected along the system trajectories, without the exact knowledge of system dynamics. The effectiveness of the derived algorithms is validated using the well-known lossy Mathieu equation.  more » « less
Award ID(s):
1903781
PAR ID:
10158690
Author(s) / Creator(s):
Date Published:
Journal Name:
2019 IEEE 58th Conference on Decision and Control (CDC)
Page Range / eLocation ID:
3322-3327
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper studies the infinite-horizon adaptive optimal control of continuous-time linear periodic (CTLP) systems. A novel value iteration (VI) based off-policy ADP algorithm is proposed for a general class of CTLP systems, so that approximate optimal solutions can be obtained directly from the collected data, without the exact knowledge of system dynamics. Under mild conditions, the proofs on uniform convergence of the proposed algorithm to the optimal solutions are given for both the model-based and model-free cases. The VI-based ADP algorithm is able to find suboptimal controllers without assuming the knowledge of an initial stabilizing controller. Application to the optimal control of a triple inverted pendulum subjected to a periodically varying load demonstrates the feasibility and effectiveness of the proposed method. 
    more » « less
  2. This paper studies the adaptive optimal control problem for a class of linear time-delay systems described by delay differential equations (DDEs). A crucial strategy is to take advantage of recent developments in reinforcement learning (RL) and adaptive dynamic programming (ADP) and develop novel methods to learn adaptive optimal controllers from finite samples of input and state data. In this paper, the data-driven policy iteration (PI) is proposed to solve the infinite-dimensional algebraic Riccati equation (ARE) iteratively in the absence of exact model knowledge. Interestingly, the proposed recursive PI algorithm is new in the present context of continuous-time time-delay systems, even when the model knowledge is assumed known. The efficacy of the proposed learning-based control methods is validated by means of practical applications arising from metal cutting and autonomous driving. 
    more » « less
  3. This paper studies the learning-based optimal control for a class of infinite-dimensional linear time-delay systems. The aim is to fill the gap of adaptive dynamic programming (ADP) where adaptive optimal control of infinite-dimensional systems is not addressed. A key strategy is to combine the classical model-based linear quadratic (LQ) optimal control of time-delay systems with the state-of-art reinforcement learning (RL) technique. Both the model-based and data-driven policy iteration (PI) approaches are proposed to solve the corresponding algebraic Riccati equation (ARE) with guaranteed convergence. The proposed PI algorithm can be considered as a generalization of ADP to infinite-dimensional time-delay systems. The efficiency of the proposed algorithm is demonstrated by the practical application arising from autonomous driving in mixed traffic environments, where human drivers’ reaction delay is considered. 
    more » « less
  4. Alessandro Astolfi (Ed.)
    This article studies the adaptive optimal stationary control of continuous-time linear stochastic systems with both additive and multiplicative noises, using reinforcement learning techniques. Based on policy iteration, a novel off-policy reinforcement learning algorithm, named optimistic least-squares-based policy iteration, is proposed, which is able to find iteratively near-optimal policies of the adaptive optimal stationary control problem directly from input/state data without explicitly identifying any system matrices, starting from an initial admissible control policy. The solutions given by the proposed optimistic least-squares-based policy iteration are proved to converge to a small neighborhood of the optimal solution with probability one, under mild conditions. The application of the proposed algorithm to a triple inverted pendulum example validates its feasibility and effectiveness. 
    more » « less
  5. We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $$\mu$$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $$ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $$ We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that $$d^\mu_h(s_h,a_h)>0$$ if $$d^{\pi^\star}_h(s_h,a_h)>0$$. Here $$\pi^\star$$ is a optimal policy, $$\mu$$ is the behavior policy and $$d(s_h,a_h)$$ is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on $$ \mu$$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL. 
    more » « less