skip to main content

Title: Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback
The ensemble method is a promising way to mitigate the overestimation issue in Q-learning, where multiple function approximators are used to estimate the action values. It is known that the estimation bias hinges heavily on the ensemble size (i.e., the number of Q-function approximators used in the target), and that determining the 'right' ensemble size is highly nontrivial, because of the time-varying nature of the function approximation errors during the learning process. To tackle this challenge, we first derive an upper bound and a lower bound on the estimation bias, based on which the ensemble size is adapted to drive the bias to be nearly zero, thereby coping with the impact of the time-varying approximation errors accordingly. Motivated by the theoretic findings, we advocate that the ensemble method can be combined with Model Identification Adaptive Control (MIAC) for effective ensemble size adaptation. Specifically, we devise Adaptive Ensemble Q-learning (AdaEQ), a generalized ensemble method with two key steps: (a) approximation error characterization which serves as the feedback for flexibly controlling the ensemble size, and (b) ensemble size adaptation tailored towards minimizing the estimation bias. Extensive experiments are carried out to show that AdaEQ can improve the learning performance than the existing methods for the MuJoCo benchmark.  more » « less
Award ID(s):
2203239 2121222 2203412 2203238
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
AAdvances in Neural Information Processing Systems 34 (NeurIPS 2021)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{-2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worst-case mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. 
    more » « less
  2. While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence. 
    more » « less
  3. Offline reinforcement learning, which aims at optimizing sequential decision-making strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic under- standing of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. We show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further im- prove our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research. 
    more » « less
  4. Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular 𝖿-divergences---Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate. 
    more » « less
  5. Estimating the evolutionary power spectral density (EPSD) of non-stationary winds (e.g., tropical storms and downbursts) is necessary to predict the response of structures under such extreme winds. Following the review of the existing direct estimation methods of EPSD, this paper offers a two-step unified formulation, i.e., raw estimation and associated error reduction. The raw estimation is expressed in terms of a time-frequency transform with a general kernel function. It is shown that if the kernel function is described by a time-frequency analysis tool such as the short-time Fourier transform, the wavelet transform, and the S-transform, the generalized raw EPSD estimation becomes a particular case of the existing methods. The unified estimation method presented here can be viewed as a filter bank with adjustable time and frequency resolution. The analysis of error in the raw estimation is carried out on the bias and random errors accounting for the approximation in both the time and frequency domains. Various techniques for reducing such errors are then summarized and recast in the unified formulation, including series expansion, short-time window smoothing, and multi-tapering. Based on the unified perspective, a discussion and some prospects of EPSD estimating are provided.

    more » « less