 Award ID(s):
 1703574
 NSFPAR ID:
 10177060
 Date Published:
 Journal Name:
 Proceedings of Machine Learning Research
 Volume:
 125
 ISSN:
 26403498
 Page Range / eLocation ID:
 6783
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

null (Ed.)Modelbased reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multiagent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the nonstationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widelyused, the sample complexity of modelbased MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of modelbased MARL. We study arguably the most basic MARL setting: twoplayer discounted zerosum Markov games, given only access to a generative model of state transition. We show that modelbased MARL achieves a near optimal sample complexity for finding the Nash equilibrium (NE) \emph{value} up to some additive error. We also show that this method is nearminimax optimal with a tight dependence on the horizon and the number of states. Our results justify the efficiency of this simple modelbased approach in the multiagent RL setting.more » « less

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a nearoptimal policy in this pessimistic MDP. The design of the pessimistic MDP is such that for any policy, the performance in the real environment is approximately lowerbounded by the performance in the pessimistic MDP. This enables the pessimistic MDP to serve as a good surrogate for purposes of policy evaluation and learning. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Empirically, MOReL matches or exceeds stateoftheart results on widely used offline RL benchmarks. Overall, the modular design of MOReL enables translating advances in its components (for e.g., in model learning, planning etc.) to improvements in offline RL.more » « less

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (PMDP) using the offline dataset; (b) learning a nearoptimal policy in this PMDP. The learned PMDP has the property that for any policy, the performance in the real environment is approximately lowerbounded by the performance in the PMDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of modelbased RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds stateoftheart results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g., in model learning, planning etc.) to directly translate into improvements for offline RL.more » « less

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an εoptimal policy is Ω(SAH/ ε2) over worst case instances of an MDP with state space S, action space A, and horizon H. We consider a class of MDPs for which the associated optimal Q* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in S and A due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q*, if one is constrained to estimate the Q function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon H to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LRMCPI) and Low Rank Empirical Value Iteration (LREVI) achieve the desired sample complexity of Õ((S+A)poly (d,H)/ε2) for a rank d setting, which is minimax optimal with respect to the scaling of S, A, and ε. In contrast to literature on linear and lowrank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal lowrank structural assumptions required on the MDP with respect to the transition kernel versus the optimal actionvalue function.more » « less

This paper studies a central issue in modern reinforcement learning, the sample efficiency, and makes progress toward solving an idealistic scenario that assumes access to a generative model or a simulator. Despite a large number of prior works tackling this problem, a complete picture of the tradeoffs between sample complexity and statistical accuracy has yet to be determined. In particular, all prior results suffer from a severe sample size barrier in the sense that their claimed statistical guarantees hold only when the sample size exceeds some enormous threshold. The current paper overcomes this barrier and fully settles this problem; more specifically, we establish the minimax optimality of the modelbased approach for any given target accuracy level. To the best of our knowledge, this work delivers the first minimaxoptimal guarantees that accommodate the entire range of sample sizes (beyond which finding a meaningful policy is information theoretically infeasible).