skip to main content

Title: Dynamic Regret of Policy Optimization in Non-Stationary Environments
We consider reinforcement learning (RL) in episodic MDPs with adversarial full-information reward feedback and unknown fixed transition kernels. We propose two model-free policy optimization algorithms, POWER and POWER++, and establish guarantees for their dynamic regret. Compared with the classical notion of static regret, dynamic regret is a stronger notion as it explicitly accounts for the non-stationarity of environments. The dynamic regret attained by the proposed algorithms interpolates between different regimes of non-stationarity, and moreover satisfies a notion of adaptive (near-)optimality, in the sense that it matches the (near-)optimal static regret under slow-changing environments. The dynamic regret bound features two components, one arising from exploration, which deals with the uncertainty of transition kernels, and the other arising from adaptation, which deals with non-stationary environments. Specifically, we show that POWER++ improves over POWER on the second component of the dynamic regret by actively adapting to non-stationarity through prediction. To the best of our knowledge, our work is the first dynamic regret analysis of model-free RL algorithms in non-stationary environments.
; ; ;
Award ID(s):
Publication Date:
Journal Name:
Advances in neural information processing systems
Page Range or eLocation-ID:
Sponsoring Org:
National Science Foundation
More Like this
  1. Model-based reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multi-agent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the non-stationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widely-used, the sample complexity of model-based MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of model-based MARL. We study arguably the most basic MARL setting: two-player discounted zero-sum Markov games, given only access to a generative model of state transition. We show that model-based MARL achieves a near optimal sample complexity for finding the Nash equilibrium (NE) \emph{value} up to some additive error. We also show that this method is near-minimax optimal with a tight dependence on the horizon and the number of states. Our results justify the efficiency of this simple model-based approach in the multi-agent RL setting.
  2. We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and non-stationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially sub-optimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive non-stationarity detection strategy, which builds on an approach recently developed for contextual Multi-armed Bandit problems. We also argue that non-adaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is non-stationary.more »Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.« less
  3. We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
  4. Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved toward characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called reference-advantage decomposition), the proposed algorithm employs an early-settled reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principlemore »of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation trade-offs.« less
  5. The instability and transition to turbulence and its evolution in pulsatile flows, which involve reverse flows and unsteady flow separations, is the primary focus of this experimental work. A piston driven by a programmable DC servo motor was used to set-up a water flow system and provide the pulsation characteristics. Time-resolved particle image velocimetry data were acquired in a refractive index matching set-up by using a continuous wave laser and a high-frame-rate digital camera. The position of the piston was continuously recorded by a laser proximity sensor. Five different experiments were carried out with Reynolds numbers in the range of 535–4825 and Womersley numbers from 11.91 to 23.82. The non-stationarity of the data was addressed by incorporating trend removal methods involving low- and high-pass filtering of the data, and using empirical mode decomposition together with the relevant Hilbert–Huang transform to determine the intrinsic mode functions. This latter method is more appropriate for nonlinear and non-stationary cases, for which traditional analysis involving classical Fourier decomposition is not directly applicable. It was found that transition to turbulence is a spontaneous event covering the whole near-wall region. The instantaneous vorticity profiles show the development of a large-scale ring-like attached wall vortical layer (WVL)more »with smaller vortices of higher frequencies than the pulsation frequency superimposed, which point to a shear layer Kelvin–Helmholtz (K–H) type of instability. Inflectional instability leads to flow separation and the formation of a major roll-up structure with the K–H vortices superimposed. This structure breaks down in the azimuthal direction into smaller turbulence patches with vortical content, which appears to be the prevailing structural content of the flow at each investigated Reynolds number ( Re ). At higher Re numbers, the strength and extent of the vortices are larger and substantial disturbances appear in the free stream region of the flow, which are typical of pipe flows at transitional Re numbers. Turbulence appears to be produced at the locations of maximum or minimum vorticity within the attached WVL, in the ridges between the K–H vortices around the separated WVL and the upstream side of the secondary vortex where the flow impinges on the wall. This wall turbulence breaks away into the middle section of the pipe, at approximately $Re \ge 2200$ , by strong eruptions of the K–H vortices.« less