We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $$\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$$ after $$T$$ steps, where $$S\times A$$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $$T$$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $$T$$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $$\epsilon$$-optimal policy with sample complexity $$\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $$\Omega(SAsp(^*)\epsilon^{-2})$$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $$t_{mix},$$ the worst-case mixing time induced by a policy. We remark that the diameter $$D$$ and mixing time $$t_{mix}$$ are both lower bounded by $sp(h^*)$, and $$t_{mix}$$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $$\gamma$$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $$\gamma$$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity.
more »
« less
No-Regret Linear Bandits beyond Realizability
We study linear bandits when the underlying reward function is not linear. Existing work relies on a uniform misspecification parameter $$\epsilon$$ that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever $$\epsilon > 0$$. We describe a more natural model of misspecification which only requires the approximation error at each input $$x$$ to be proportional to the suboptimality gap at $$x$$. It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm — designed for the realizable case — is automatically robust against such gap-adjusted misspecification. It achieves a near-optimal $$\sqrt{T}$$ regret for problems that the best-known regret is almost linear in time horizon $$T$$. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself.
more »
« less
- PAR ID:
- 10466938
- Editor(s):
- Evans, Robin J.; Shpitser, Ilya
- Publisher / Repository:
- UAI 2023
- Date Published:
- Journal Name:
- Proceedings of Machine Learning Research
- Volume:
- 216
- ISSN:
- 2640-3498
- Page Range / eLocation ID:
- 1294--1303
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are non-convex. We study the loss landscape of these robust estimation problems, and identify the existence of ’generalized quasi-gradients’. Whenever these quasi-gradients exist, a large family of no-regret algorithms are guaranteed to approximate the global minimum; this includes the commonly used filtering algorithm. For robust mean estimation of distributions under bounded covariance, we show that any first-order stationary point of the associated optimization problem is an approximate global minimum if and only if the corruption level $$\epsilon < 1/3$$. Consequently, any optimization algorithm that approaches a stationary point yields an efficient robust estimator with breakdown point $1/3$. With carefully designed initialization and step size, we improve this to $1/2$, which is optimal. For other tasks, including linear regression and joint mean and covariance estimation, the loss landscape is more rugged: there are stationary points arbitrarily far from the global minimum. Nevertheless, we show that generalized quasi-gradients exist and construct efficient algorithms. These algorithms are simpler than previous ones in the literature, and for linear regression we improve the estimation error from $$O(\sqrt{\epsilon })$$ to the optimal rate of $$O(\epsilon )$$ for small $$\epsilon $$ assuming certified hypercontractivity. For mean estimation with near-identity covariance, we show that a simple gradient descent algorithm achieves breakdown point $1/3$ and iteration complexity $$\tilde{O}(d/\epsilon ^2)$$.more » « less
-
Ruiz, Francisco and (Ed.)Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an $$\epsilon$$-JDP algorithm with a regret of $$\widetilde{O}(\sqrt{SAH^2T}+S^2AH^3/\epsilon)$$ which matches the information-theoretic lower bound of non-private learning for all choices of $$\epsilon> S^{1.5}A^{0.5} H^2/\sqrt{T}$$. In the above, $$S$$, $$A$$ denote the number of states and actions, $$H$$ denotes the planning horizon, and $$T$$ is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves privacy for free asymptotically as $$T\rightarrow \infty$$. Our techniques — which could be of independent interest — include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.more » « less
-
The Prophet Inequality and Pandora's Box problems are fundamental stochastic problem with applications in Mechanism Design, Online Algorithms, Stochastic Optimization, Optimal Stopping, and Operations Research. A usual assumption in these works is that the probability distributions of the n underlying random variables are given as input to the algorithm. Since in practice these distributions need to be learned under limited feedback, we initiate the study of such stochastic problems in the Multi-Armed Bandits model. In the Multi-Armed Bandits model we interact with n unknown distributions over T rounds: in round t we play a policy x(t) and only receive the value of x(t) as feedback. The goal is to minimize the regret, which is the difference over T rounds in the total value of the optimal algorithm that knows the distributions vs. the total value of our algorithm that learns the distributions from the limited feedback. Our main results give near-optimal Õ (poly (n) √T) total regret algorithms for both Prophet Inequality and Pandora's Box. Our proofs proceed by maintaining confidence intervals on the unknown indices of the optimal policy. The exploration-exploitation tradeoff prevents us from directly refining these confidence intervals, so the main technique is to design a regret upper bound function that is learnable while playing low-regret Bandit policies.more » « less
-
null (Ed.)We develop a framework for designing simple and efficient policies for a family of online allocation and pricing problems that includes online packing, budget-constrained probing, dynamic pricing, and online contextual bandits with knapsacks. In each case, we evaluate the performance of our policies in terms of their regret (i.e., additive gap) relative to an offline controller that is endowed with more information than the online controller. Our framework is based on Bellman inequalities, which decompose the loss of an algorithm into two distinct sources of error: (1) arising from computational tractability issues, and (2) arising from estimation/prediction of random trajectories. Balancing these errors guides the choice of benchmarks, and leads to policies that are both tractable and have strong performance guarantees. In particular, in all our examples, we demonstrate constant-regret policies that only require resolving a linear program in each period, followed by a simple greedy action-selection rule; thus, our policies are practical as well as provably near optimal.more » « less