We study model selection in linear bandits, where the learner must adapt to the dimension (denoted by πβ ) of the smallest hypothesis class containing the true linear model while balancing exploration and exploitation. Previous papers provide various guarantees for this model selection problem, but have limitations; i.e., the analysis requires favorable conditions that allow for inexpensive statistical testing to locate the right hypothesis class or are based on the idea of βcorrallingβ multiple base algorithms, which often performs relatively poorly in practice. These works also mainly focus on upper bounds. In this paper, we establish the first lower bound for the model selection problem. Our lower bound implies that, even with a fixed action set, adaptation to the unknown dimension πβ comes at a cost: There is no algorithm that can achieve the regret bound πΛ(πβπβΎβΎβΎβΎβ) simultaneously for all values of πβ . We propose Pareto optimal algorithms that match the lower bound. Empirical evaluations show that our algorithm enjoys superior performance compared to existing ones.
more »
« less
An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires π(π‘) time at the π‘-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to π‘, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves πΜ (πβΎβΎβ) regret with the total time complexity that scales linearly in π and π, where π is the total number of rounds and π is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.
more »
« less
- Award ID(s):
- 1712996
- PAR ID:
- 10402485
- Editor(s):
- Banerjee, Arindam; Fukumizu, Kenji
- Date Published:
- Journal Name:
- Proceedings of Machine Learning Research
- Volume:
- 130
- ISSN:
- 2640-3498
- Page Range / eLocation ID:
- 1585-1593
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Thanks to the power of representation learning, neural contextual bandit algorithms demonstrate remarkable performance improvement against their classical counterparts. But because their exploration has to be performed in the entire neural network parameter space to obtain nearly optimal regret, the resulting computational cost is prohibitively high. We perturb the rewards when updating the neural network to eliminate the need of explicit exploration and the corresponding computational overhead. We prove that a O(d\sqrt{T}) regret upper bound is still achievable under standard regularity conditions, where $$T$$ is the number of rounds of interactions and $$\tilde{d}$$ is the effective dimension of a neural tangent kernel matrix. Extensive comparisons with several benchmark contextual bandit algorithms, including two recent neural contextual bandit models, demonstrate the effectiveness and computational efficiency of our proposed neural bandit algorithm.more » « less
-
Thompson sampling has become a ubiquitous ap- proach to online decision problems with bandit feedback. The key algorithmic task for Thomp- son sampling is drawing a sample from the pos- terior of the optimal action. We propose an al- ternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance im- provements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehen- sive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoreti- cal perspective, we establish optimal regret guar- antees for TS-UCB for both the K-armed and linear bandit models.more » « less
-
Numerical solutions of stochastic problems involving random processes π(π‘), which constitutes infinite families of random variables, require to represent these processes by finite dimensional (FD) models ππ (π‘), i.e., deterministic functions of time depending on finite numbers π of random variables. Most available FD models match the mean, correlation, and other global properties of π(π‘). They provide useful information to a broad range of problems, but cannot be used to estimate extremes or other sample properties of π(π‘). We develop FD models ππ (π‘) for processes π(π‘) with continuous samples and establish conditions under which these models converge weakly to π(π‘) in the space of continuous functions as π β β. These theoretical results are illustrated by numerical examples which show that, under the conditions established in this study, samples and extremes of π(π‘) can be approximated by samples and extremes of ππ (π‘) and that the discrepancy between samples and extremes of these processes decreases with π.more » « less
-
We study a nonparametric contextual bandit problem in which the expected reward functions belong to a HΓΆlder class with smoothness parameter Ξ². We show how this interpolates between two extremes that were previously studied in isolation: nondifferentiable bandits (Ξ² at most 1), with which rate-optimal regret is achieved by running separate noncontextual bandits in different context regions, and parametric-response bandits (infinite [Formula: see text]), with which rate-optimal regret can be achieved with minimal or no exploration because of infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings, and we prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and nondifferentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits.more » « less
An official website of the United States government

