We propose a model-based lifelong reinforcement-learning approach that estimates a hierarchical Bayesian posterior distilling the common structure shared across different tasks. The learned posterior combined with a sample-based Bayesian exploration procedure increases the sample efficiency of learning across a family of related tasks. We first derive an analysis of the relationship between the sample complexity and the initialization quality of the posterior in the finite MDP setting. We next scale the approach to continuous-state domains by introducing a Variational Bayesian Lifelong Reinforcement Learning algorithm that can be combined with recent model-based deep RL methods, and that exhibits backward transfer. Experimental results on several challenging domains show that our algorithms achieve both better forward and backward transfer performance than state-of-the-art lifelong RL methods
more »
« less
Accelerating Model Free Reinforcement Learning with Imperfect Model Knowledge in Dynamic Spectrum Access
Current studies that apply reinforcement learning (RL) to dynamic spectrum access (DSA) problems in wireless communications systems are mainly focusing on model-free RL. However, in practice model-free RL requires large number of samples to achieve good performance making it impractical in real time applications such as DSA. Combining model-free and model-based RL can potentially reduce the sample complexity while achieving similar level of performance as model-free RL as long as the learned model is accurate enough. However, in complex environment the learned model is never perfect. In this paper we combine model-free and model-based reinforcement learning, introduce an algorithm that can work with an imperfectly learned model to accelerate the model-free reinforcement learning. Results show our algorithm achieves higher sample efficiency than standard model-free RL algorithm and Dyna algorithm (a standard algorithm that integrating model-based and model-free RL) with much lower computation complexity than the Dyna algorithm. For the extreme case where the learned model is highly inaccurate, the Dyna algorithm performs even worse than the model-free RL algorithm while our algorithm can still outperform the model-free RL algorithm.
more »
« less
- PAR ID:
- 10161733
- Date Published:
- Journal Name:
- IEEE Internet of Things Journal
- ISSN:
- 2372-2541
- Page Range / eLocation ID:
- 1 to 1
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools.Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms. @inproceedings{ liang2024singletrajectory, title={Single-Trajectory Distributionally Robust Reinforcement Learning}, author={Zhipeng Liang and Xiaoteng Ma and Jose Blanchet and Jun Yang and Jiheng Zhang and Zhengyuan Zhou}, booktitle={Forty-first International Conference on Machine Learning}, year={2024}, url={https://openreview.net/forum?id=3B6vmW2L80} }more » « less
-
Salakhutdinov, Ruslan; Kolter, Zico; Heller, Katherine; Weller, Adrian; Oliver, Nuria; Scarlett, Jonathan; Berkenkamp, Felix (Ed.)To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm’s complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.more » « less
-
Salakhutdinov, Ruslan; Kolter, Zico; Heller, Katherine; Weller, Adrian; Oliver, Nuria; Scarlett, Jonathan; Berkenkamp, Felix (Ed.)To mitigate the limitation that the classical reinforcement learning (RL) framework heavily relies on identical training and test environments, Distributionally Robust RL (DRRL) has been proposed to enhance performance across a range of environments, possibly including unknown test environments. As a price for robustness gain, DRRL involves optimizing over a set of distributions, which is inherently more challenging than optimizing over a fixed distribution in the non-robust case. Existing DRRL algorithms are either model-based or fail to learn from a single sample trajectory. In this paper, we design a first fully model-free DRRL algorithm, called distributionally robust Q-learning with single trajectory (DRQ). We delicately design a multi-timescale framework to fully utilize each incrementally arriving sample and directly learn the optimal distributionally robust policy without modeling the environment, thus the algorithm can be trained along a single trajectory in a model-free fashion. Despite the algorithm's complexity, we provide asymptotic convergence guarantees by generalizing classical stochastic approximation tools. Comprehensive experimental results demonstrate the superior robustness and sample complexity of our proposed algorithm, compared to non-robust methods and other robust RL algorithms.more » « less
-
The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an ϵ-optimal Nash Equilibrium (NE) with the sample complexity of O(H3SAB/ϵ2), which is optimal in the dependence of the horizon H and the number of states S (where A and B denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the H dependence as model-based algorithms. The main improvement of the dependency on H arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency.more » « less
An official website of the United States government

