skip to main content

Title: Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback
The ensemble method is a promising way to mitigate the overestimation issue in Q-learning, where multiple function approximators are used to estimate the action values. It is known that the estimation bias hinges heavily on the ensemble size (i.e., the number of Q-function approximators used in the target), and that determining the 'right' ensemble size is highly nontrivial, because of the time-varying nature of the function approximation errors during the learning process. To tackle this challenge, we first derive an upper bound and a lower bound on the estimation bias, based on which the ensemble size is adapted to drive the bias to be nearly zero, thereby coping with the impact of the time-varying approximation errors accordingly. Motivated by the theoretic findings, we advocate that the ensemble method can be combined with Model Identification Adaptive Control (MIAC) for effective ensemble size adaptation. Specifically, we devise Adaptive Ensemble Q-learning (AdaEQ), a generalized ensemble method with two key steps: (a) approximation error characterization which serves as the feedback for flexibly controlling the ensemble size, and (b) ensemble size adaptation tailored towards minimizing the estimation bias. Extensive experiments are carried out to show that AdaEQ can improve the learning performance than the existing more » methods for the MuJoCo benchmark. « less
; ;
Award ID(s):
2203239 2121222 2203412 2203238
Publication Date:
Journal Name:
AAdvances in Neural Information Processing Systems 34 (NeurIPS 2021)
Sponsoring Org:
National Science Foundation
More Like this
  1. While there are convergence guarantees for temporal difference (TD) learning when using linear function approximators, the situation for nonlinear models is far less understood, and divergent examples are known. Here we take a first step towards extending theoretical convergence guarantees to TD learning with nonlinear function approximation. More precisely, we consider the expected learning dynamics of the TD(0) algorithm for value estimation. As the step-size converges to zero, these dynamics are defined by a nonlinear ODE which depends on the geometry of the space of function approximators, the structure of the underlying Markov chain, and their interaction. We find a set of function approximators that includes ReLU networks and has geometry amenable to TD learning regardless of environment, so that the solution performs about as well as linear TD in the worst case. Then, we show how environments that are more reversible induce dynamics that are better for TD learning and prove global convergence to the true value function for well-conditioned function approximators. Finally, we generalize a divergent counterexample to a family of divergent problems to demonstrate how the interaction between approximator and environment can go wrong and to motivate the assumptions needed to prove convergence.
  2. We develop a simple Quantile Spacing (QS) method for accurate probabilistic estimation of one-dimensional entropy from equiprobable random samples, and compare it with the popular Bin-Counting (BC) and Kernel Density (KD) methods. In contrast to BC, which uses equal-width bins with varying probability mass, the QS method uses estimates of the quantiles that divide the support of the data generating probability density function (pdf) into equal-probability-mass intervals. And, whereas BC and KD each require optimal tuning of a hyper-parameter whose value varies with sample size and shape of the pdf, QS only requires specification of the number of quantiles to be used. Results indicate, for the class of distributions tested, that the optimal number of quantiles is a fixed fraction of the sample size (empirically determined to be ~0.25–0.35), and that this value is relatively insensitive to distributional form or sample size. This provides a clear advantage over BC and KD since hyper-parameter tuning is not required. Further, unlike KD, there is no need to select an appropriate kernel-type, and so QS is applicable to pdfs of arbitrary shape, including those with discontinuous slope and/or magnitude. Bootstrapping is used to approximate the sampling variability distribution of the resulting entropy estimate,more »and is shown to accurately reflect the true uncertainty. For the four distributional forms studied (Gaussian, Log-Normal, Exponential and Bimodal Gaussian Mixture), expected estimation bias is less than 1% and uncertainty is low even for samples of as few as 100 data points; in contrast, for KD the small sample bias can be as large as −10% and for BC as large as −50%. We speculate that estimating quantile locations, rather than bin-probabilities, results in more efficient use of the information in the data to approximate the underlying shape of an unknown data generating pdf.« less
  3. Statistical divergences (SDs), which quantify the dissimilarity between probability distributions, are a basic constituent of statistical inference and machine learning. A modern method for estimating those divergences relies on parametrizing an empirical variational form by a neural network (NN) and optimizing over parameter space. Such neural estimators are abundantly used in practice, but corresponding performance guarantees are partial and call for further exploration. We establish non-asymptotic absolute error bounds for a neural estimator realized by a shallow NN, focusing on four popular 𝖿-divergences---Kullback-Leibler, chi-squared, squared Hellinger, and total variation. Our analysis relies on non-asymptotic function approximation theorems and tools from empirical process theory to bound the two sources of error involved: function approximation and empirical estimation. The bounds characterize the effective error in terms of NN size and the number of samples, and reveal scaling rates that ensure consistency. For compactly supported distributions, we further show that neural estimators of the first three divergences above with appropriate NN growth-rate are minimax rate-optimal, achieving the parametric convergence rate.
  4. The advent of pervasive autonomous systems such as self-driving cars and drones has raised questions about their safety and trustworthiness. This is particularly relevant in the event of on-board subsystem errors or failures. In this research, we show how encoded Extended Kalman Filter can be used to detect anomalous behaviors of critical components of nonlinear autonomous systems: sensors, actuators, state estimation algorithms and control software. As opposed to prior work that is limited to linear systems or requires the use of cumbersome machine learned checks with fixed detection thresholds, the proposed approach necessitates the use of time-varying checks with dynamically adaptive thresholds. The method is lightweight in comparison to existing methods (does not rely on machine learning paradigms) and achieves high coverage as well as low detection latency of errors. A quadcopter and an automotive steer-by-wire system are used as test vehicles for the research and simulation and hardware results indicate the overhead, coverage and error detection latency benefits of the proposed approach.
  5. Estimating the evolutionary power spectral density (EPSD) of non-stationary winds (e.g., tropical storms and downbursts) is necessary to predict the response of structures under such extreme winds. Following the review of the existing direct estimation methods of EPSD, this paper offers a two-step unified formulation, i.e., raw estimation and associated error reduction. The raw estimation is expressed in terms of a time-frequency transform with a general kernel function. It is shown that if the kernel function is described by a time-frequency analysis tool such as the short-time Fourier transform, the wavelet transform, and the S-transform, the generalized raw EPSD estimation becomes a particular case of the existing methods. The unified estimation method presented here can be viewed as a filter bank with adjustable time and frequency resolution. The analysis of error in the raw estimation is carried out on the bias and random errors accounting for the approximation in both the time and frequency domains. Various techniques for reducing such errors are then summarized and recast in the unified formulation, including series expansion, short-time window smoothing, and multi-tapering. Based on the unified perspective, a discussion and some prospects of EPSD estimating are provided.