 NSFPAR ID:
 10440145
 Date Published:
 Journal Name:
 Proceedings of Thirty Fifth Conference on Learning Theory
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.more » « less

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an informationtheoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal stateaction probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform datacoverage assumption, horizonfree setting, single policy concentrability, and the tight problemdependent results. Later, we extend the result to the \emph{assumptionfree} regime (where we make no assumption on $ \mu$) and obtain the assumptionfree intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL.more » « less

Abstract In this paper, we propose a new framework to construct confidence sets for a $d$dimensional unknown sparse parameter ${\boldsymbol \theta }$ under the normal mean model ${\boldsymbol X}\sim N({\boldsymbol \theta },\sigma ^{2}\bf{I})$. A key feature of the proposed confidence set is its capability to account for the sparsity of ${\boldsymbol \theta }$, thus named as sparse confidence set. This is in sharp contrast with the classical methods, such as the Bonferroni confidence intervals and other resamplingbased procedures, where the sparsity of ${\boldsymbol \theta }$ is often ignored. Specifically, we require the desired sparse confidence set to satisfy the following two conditions: (i) uniformly over the parameter space, the coverage probability for ${\boldsymbol \theta }$ is above a prespecified level; (ii) there exists a random subset $S$ of $\{1,...,d\}$ such that $S$ guarantees the prespecified true negative rate for detecting nonzero $\theta _{j}$’s. To exploit the sparsity of ${\boldsymbol \theta }$, we allow the confidence interval for $\theta _{j}$ to degenerate to a single point 0 for any $j\notin S$. Under this new framework, we first consider whether there exist sparse confidence sets that satisfy the above two conditions. To address this question, we establish a nonasymptotic minimax lower bound for the noncoverage probability over a suitable class of sparse confidence sets. The lower bound deciphers the role of sparsity and minimum signaltonoise ratio (SNR) in the construction of sparse confidence sets. Furthermore, under suitable conditions on the SNR, a twostage procedure is proposed to construct a sparse confidence set. To evaluate the optimality, the proposed sparse confidence set is shown to attain a minimax lower bound of some properly defined risk function up to a constant factor. Finally, we develop an adaptive procedure to the unknown sparsity. Numerical studies are conducted to verify the theoretical results.

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tailaveraging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problemdependent extent to which minibatching can be used to yield provable nearlinear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this minibatching characterization is utilized in providing a highly parallelizable SGD method that achieves the min imax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGDstyle methods. Following this, a nonasymptotic excess risk bound for model averaging (which is a communication efficient parallelization scheme) is provided. Finally, this work sheds light on fundamental differences in SGD’s behavior when dealing with misspecified models in the nonrealizable least squares problem. This paper shows that maximal stepsizes ensuring minimax risk for the misspecified case must depend on the noise properties. The analysis tools used by this paper generalize the operator view of averaged SGD (De ́fossez and Bach, 2015) followed by developing a novel analysis in bounding these operators to char acterize the generalization error. These techniques are of broader interest in analyzing various computational aspects of stochastic approximation.more » « less

Let
$f$ be analytic on$[0,1]$ with$f^{(k)}(1/2)\leqslant A\alpha ^kk!$ for some constants$A$ and$\alpha >2$ and all$k\geqslant 1$ . We show that the median estimate of$\mu =\int _0^1f(x)\,\mathrm {d} x$ under random linear scrambling with$n=2^m$ points converges at the rate$O(n^{c\log (n)})$ for any$c> 3\log (2)/\pi ^2\approx 0.21$ . We also get a superpolynomial convergence rate for the sample median of$2k1$ random linearly scrambled estimates, when$k/m$ is bounded away from zero. When$f$ has a$p$ ’th derivative that satisfies a$\lambda$ Hölder condition then the median of means has error$O( n^{(p+\lambda )+\epsilon })$ for any$\epsilon >0$ , if$k\to \infty$ as$m\to \infty$ . The proof techniques use methods from analytic combinatorics that have not previously been applied to quasiMonte Carlo methods, most notably an asymptotic expression from Hardy and Ramanujan on the number of partitions of a natural number.