We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexitymore »
This content will become publicly available on February 21, 2024
Breaking the sample complexity barrier to regretoptimal modelfree reinforcement learning
Abstract Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finitehorizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved toward characterizing the minimaxoptimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memoryinefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g. $S^6A^4 \,\mathrm{poly}(H)$ for existing modelfree methods). To overcome such a large sample size barrier to efficient RL, we design a novel modelfree algorithm, with space complexity $O(SAH)$, that achieves nearoptimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burnin cost), our method improves—by at least a factor of $S^5A^3$—upon any prior memoryefficient algorithm that is asymptotically regretoptimal. Leveraging the recently introduced variance reduction strategy (also called referenceadvantage decomposition), the proposed algorithm employs an earlysettled reference update rule, with the aid of two Qlearning sequences with upper and lower confidence bounds. The design principle more »
 Publication Date:
 NSFPAR ID:
 10406929
 Journal Name:
 Information and Inference: A Journal of the IMA
 Volume:
 12
 Issue:
 2
 Page Range or eLocationID:
 969 to 1043
 ISSN:
 20498772
 Sponsoring Org:
 National Science Foundation
More Like this


Chaudhuri, Kamalika and (Ed.)We study the problem of reinforcement learning (RL) with low (policy) switching cost {—} a problem wellmotivated by reallife RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stagewise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the bestknown switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. As a byproduct of our new techniques, we also derive a rewardfree exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of informationtheoretical lower bounds which say that (1) Any noregret algorithm must have a switching cost of $\Omega(HSA)$; (2) Any $\widetilde{O}(\sqrt{T})$ regret algorithm must incur a switching cost of $\Omega(HSA\log\log T)$. Both our algorithms are thus optimal in their switching costs.

Offline or batch reinforcement learning seeks to learn a nearoptimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of modelbased algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their modelfree counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Qlearning in the context of finitehorizon Markov decision processes, and characterize its sample complexity under the singlepolicy concentrability assumption which does not require the full coverage of the stateaction space. In addition, a variancereduced pessimistic Qlearning algorithm is proposed to achieve nearoptimal sample complexity. Altogether, this work highlights the efficiency of modelfree algorithms in offline RL when used in conjunction with pessimism and variance reduction.

Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that noregret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an ϵJDP algorithm with a regret of O˜(sqrt(SAH^2T) +S^2AH^3/ϵ) which matches the informationtheoretic lower bound of nonprivate learning for all choices of ϵ>S^1.5A^0.5H^2/sqrt(T). In the above, S, A denote the number of states and actions, H denotes the planning horizon, and T is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves \emph{privacy for free} asymptotically as T→∞. Our techniques  which could be of independent interest  include privately releasing Bernsteintype exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.

We consider the problem of offline reinforcement learning (RL)  a wellmotivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose OffPolicy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an ϵoptimal policy with O˜(H2/dmϵ2) episodes of offline data in the finitehorizon stationary transition setting, where H is the horizon length and dm is the minimal marginal stateaction distribution induced by the behavior policy. This improves over the best known upper bound by a factor of H. Moreover, we establish an informationtheoretic lower bound of Ω(H2/dmϵ2) which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rateoptimal sample complexity under alternative settings such as the finitehorizon MDPs with nonstationary transitions and the infinite horizon MDPs with discounted rewards.