skip to main content


Title: An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires 𝑂(𝑑) time at the 𝑑-th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the per-step time and memory complexity can be reduced to constant with respect to 𝑑, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGD-TS algorithm, which uses a single-step SGD update to exploit past information and uses Thompson Sampling for exploration, achieves 𝑂̃ (π‘‡β€Ύβ€Ύβˆš) regret with the total time complexity that scales linearly in 𝑇 and 𝑑, where 𝑇 is the total number of rounds and 𝑑 is the number of features. Experimental results show that SGD-TS consistently outperforms existing algorithms on both synthetic and real datasets.  more » « less
Award ID(s):
1712996
PAR ID:
10402485
Author(s) / Creator(s):
; ;
Editor(s):
Banerjee, Arindam; Fukumizu, Kenji
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
130
ISSN:
2640-3498
Page Range / eLocation ID:
1585-1593
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study model selection in linear bandits, where the learner must adapt to the dimension (denoted by 𝑑⋆ ) of the smallest hypothesis class containing the true linear model while balancing exploration and exploitation. Previous papers provide various guarantees for this model selection problem, but have limitations; i.e., the analysis requires favorable conditions that allow for inexpensive statistical testing to locate the right hypothesis class or are based on the idea of β€œcorralling” multiple base algorithms, which often performs relatively poorly in practice. These works also mainly focus on upper bounds. In this paper, we establish the first lower bound for the model selection problem. Our lower bound implies that, even with a fixed action set, adaptation to the unknown dimension 𝑑⋆ comes at a cost: There is no algorithm that can achieve the regret bound π‘‚Λœ(π‘‘β‹†π‘‡β€Ύβ€Ύβ€Ύβ€Ύβˆš) simultaneously for all values of 𝑑⋆ . We propose Pareto optimal algorithms that match the lower bound. Empirical evaluations show that our algorithm enjoys superior performance compared to existing ones. 
    more » « less
  2. Thanks to the power of representation learning, neural contextual bandit algorithms demonstrate remarkable performance improvement against their classical counterparts. But because their exploration has to be performed in the entire neural network parameter space to obtain nearly optimal regret, the resulting computational cost is prohibitively high. We perturb the rewards when updating the neural network to eliminate the need of explicit exploration and the corresponding computational overhead. We prove that a O(d\sqrt{T}) regret upper bound is still achievable under standard regularity conditions, where $T$ is the number of rounds of interactions and $\tilde{d}$ is the effective dimension of a neural tangent kernel matrix. Extensive comparisons with several benchmark contextual bandit algorithms, including two recent neural contextual bandit models, demonstrate the effectiveness and computational efficiency of our proposed neural bandit algorithm. 
    more » « less
  3. Numerical solutions of stochastic problems involving random processes 𝑋(𝑑), which constitutes infinite families of random variables, require to represent these processes by finite dimensional (FD) models 𝑋𝑑 (𝑑), i.e., deterministic functions of time depending on finite numbers 𝑑 of random variables. Most available FD models match the mean, correlation, and other global properties of 𝑋(𝑑). They provide useful information to a broad range of problems, but cannot be used to estimate extremes or other sample properties of 𝑋(𝑑). We develop FD models 𝑋𝑑 (𝑑) for processes 𝑋(𝑑) with continuous samples and establish conditions under which these models converge weakly to 𝑋(𝑑) in the space of continuous functions as 𝑑 β†’ ∞. These theoretical results are illustrated by numerical examples which show that, under the conditions established in this study, samples and extremes of 𝑋(𝑑) can be approximated by samples and extremes of 𝑋𝑑 (𝑑) and that the discrepancy between samples and extremes of these processes decreases with 𝑑. 
    more » « less
  4. We study a nonparametric contextual bandit problem in which the expected reward functions belong to a HΓΆlder class with smoothness parameter Ξ². We show how this interpolates between two extremes that were previously studied in isolation: nondifferentiable bandits (Ξ² at most 1), with which rate-optimal regret is achieved by running separate noncontextual bandits in different context regions, and parametric-response bandits (infinite [Formula: see text]), with which rate-optimal regret can be achieved with minimal or no exploration because of infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings, and we prove its regret is rate-optimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and nondifferentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits. 
    more » « less
  5. Conditions under which samples of continuous stochastic processes 𝑋(𝑑) on bounded time intervals [0, 𝜏] can be represented by samples of finite dimensional (FD) processes 𝑋𝑑 (𝑑) are augmented such that samples of Slepian models 𝑆𝑑,π‘Ž(𝑑) of 𝑋𝑑 (𝑑) can be used as surrogates for samples of Slepian models π‘†π‘Ž(𝑑) of 𝑋(𝑑). FD processes are deterministic functions of time and 𝑑 < ∞ random variables. The discrepancy between target and FD samples is quantified by the metric of the space 𝐢[0, 𝜏] of continuous functions. The numerical illustrations, which include Gaussian/non-Gaussian FD processes and solutions of linear/nonlinear random vibration problems, are consistent with the theoretical findings in the paper. 
    more » « less