This content will become publicly available on December 1, 2022

Towards instance-optimal offline reinforcement learning with pessimism
We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right).$ We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on $\mu$) and obtain the assumption-free intrinsic bound. Due to its generic more »
Authors:
;
Award ID(s):
Publication Date:
NSF-PAR ID:
10346207
Journal Name:
Advances in neural information processing systems
Volume:
34
Page Range or eLocation-ID:
4065--4078
ISSN:
1049-5258
We study the problem of reinforcement learning (RL) with low (policy) switching cost {—} a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the best-known switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$-horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of $\Omega(HSA)$; (2) Any $\widetilde{O}(\sqrt{T})$ regret algorithm must incur a switching cost of $\Omega(HSA\log\log T)$. Both our algorithms are thus optimal in their switching costs.
5. The classic graphical Cheeger inequalities state that if $M$ is an $n\times n$ \emph{symmetric} doubly stochastic matrix, then $\frac{1-\lambda_{2}(M)}{2}\leq\phi(M)\leq\sqrt{2\cdot(1-\lambda_{2}(M))}$ where $\phi(M)=\min_{S\subseteq[n],|S|\leq n/2}\left(\frac{1}{|S|}\sum_{i\in S,j\not\in S}M_{i,j}\right)$ is the edge expansion of $M$, and $\lambda_{2}(M)$ is the second largest eigenvalue of $M$. We study the relationship between $\phi(A)$ and the spectral gap $1-\re\lambda_{2}(A)$ for \emph{any} doubly stochastic matrix $A$ (not necessarily symmetric), where $\lambda_{2}(A)$ is a nontrivial eigenvalue of $A$ with maximum real part. Fiedler showed that the upper bound on $\phi(A)$ is unaffected, i.e., $\phi(A)\leq\sqrt{2\cdot(1-\re\lambda_{2}(A))}$. With regards to the lower bound on $\phi(A)$, there are known constructions with $\phi(A)\in\Theta\left(\frac{1-\re\lambda_{2}(A)}{\log n}\right),$ indicating that at least a mild dependence on $n$ is necessary to lower bound $\phi(A)$. In our first result, we provide an \emph{exponentially} better construction of $n\times n$ doubly stochastic matrices $A_{n}$, for which $\phi(A_{n})\leq\frac{1-\re\lambda_{2}(A_{n})}{\sqrt{n}}.$ In fact, \emph{all} nontrivial eigenvalues of our matrices are $0$, even though the matrices are highly \emph{nonexpanding}. We further show that this bound is in the correct range (up to the exponent of $n$), by showing that for any doubly stochastic matrix $A$, $\phi(A)\geq\frac{1-\re\lambda_{2}(A)}{35\cdot n}.$ As a consequence, unlike the symmetric case, there is a (necessary) loss of amore »