We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.
more »
« less
Pessimistic QLearning for Offline Reinforcement Learning: Towards Optimal Sample Complexity
Offline or batch reinforcement learning seeks to learn a nearoptimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of modelbased algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their modelfree counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Qlearning in the context of finitehorizon Markov decision processes, and characterize its sample complexity under the singlepolicy concentrability assumption which does not require the full coverage of the stateaction space. In addition, a variancereduced pessimistic Qlearning algorithm is proposed to achieve nearoptimal sample complexity. Altogether, this work highlights the efficiency of modelfree algorithms in offline RL when used in conjunction with pessimism and variance reduction.
more »
« less
 NSFPAR ID:
 10340928
 Date Published:
 Journal Name:
 Proceedings of the 39th International Conference on Machine Learning
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.more » « less

Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finitehorizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved toward characterizing the minimaxoptimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memoryinefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $S^6A^4 \,\mathrm{poly}(H)$ for existing modelfree methods). To overcome such a large sample size barrier to efficient RL, we design a novel modelfree algorithm, with space complexity $O(SAH)$, that achieves nearoptimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burnin cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memoryefficient algorithm that is asymptotically regretoptimal. Leveraging the recently introduced variance reduction strategy (also called referenceadvantage decomposition), the proposed algorithm employs an earlysettled reference update rule, with the aid of two Qlearning sequences with upper and lower confidence bounds. The design principle of our earlysettled variance reduction method might be of independent interest to other RL settings that involve intricate exploration–exploitation tradeoffs.more » « less

The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an εoptimal policy is Ω(SAH/ ε2) over worst case instances of an MDP with state space S, action space A, and horizon H. We consider a class of MDPs for which the associated optimal Q* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in S and A due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q*, if one is constrained to estimate the Q function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon H to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LRMCPI) and Low Rank Empirical Value Iteration (LREVI) achieve the desired sample complexity of Õ((S+A)poly (d,H)/ε2) for a rank d setting, which is minimax optimal with respect to the scaling of S, A, and ε. In contrast to literature on linear and lowrank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal lowrank structural assumptions required on the MDP with respect to the transition kernel versus the optimal actionvalue function.more » « less

We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sampleefficiency guarantees with a dataset lacking sufficient coverage, under only realizabilitytype assumptions for the function approximators. While the existing theory has addressed learning under realizability and under nonexploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and densityratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.more » « less