We study the problem of online multitask learning where the tasks are performed within similar but not necessarily identical multiarmed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)based algorithm has recently been shown to achieve nearlyoptimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TStype algorithm for a more general online multitask learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearlyoptimal using a novel concentration inequality for multitask data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TStype algorithm enjoys superior empirical performance in comparison with the UCBbased algorithm and a baseline algorithm that performs TS for each individual task without transfer.
more »
« less
An efficient algorithm for generalized linear bandit: Online stochastic gradient descent and thompson sampling
We consider the contextual bandit problem, where a player sequentially makes decisions based on past observations to maximize the cumulative reward. Although many algorithms have been proposed for contextual bandit, most of them rely on finding the maximum likelihood estimator at each iteration, which requires π(π‘) time at the π‘th iteration and are memory inefficient. A natural way to resolve this problem is to apply online stochastic gradient descent (SGD) so that the perstep time and memory complexity can be reduced to constant with respect to π‘, but a contextual bandit policy based on online SGD updates that balances exploration and exploitation has remained elusive. In this work, we show that online SGD can be applied to the generalized linear bandit problem. The proposed SGDTS algorithm, which uses a singlestep SGD update to exploit past information and uses Thompson Sampling for exploration, achieves πΜ (πβΎβΎβ) regret with the total time complexity that scales linearly in π and π, where π is the total number of rounds and π is the number of features. Experimental results show that SGDTS consistently outperforms existing algorithms on both synthetic and real datasets.
more »
« less
 Award ID(s):
 1712996
 NSFPAR ID:
 10402485
 Editor(s):
 Banerjee, Arindam; Fukumizu, Kenji
 Date Published:
 Journal Name:
 Proceedings of Machine Learning Research
 Volume:
 130
 ISSN:
 26403498
 Page Range / eLocation ID:
 15851593
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


There has been a flurry of recent literature studying streaming algorithms for which the input stream is chosen adaptively by a blackbox adversary who observes the output of the streaming algorithm at each time step. However, these algorithms fail when the adversary has access to the internal state of the algorithm, rather than just the output of the algorithm. We study streaming algorithms in the whitebox adversarial model, where the stream is chosen adaptively by an adversary who observes the entire internal state of the algorithm at each time step. We show that nontrivial algorithms are still possible. We first give a randomized algorithm for the L1heavy hitters problem that outperforms the optimal deterministic MisraGries algorithm on long streams. If the whitebox adversary is computationally bounded, we use cryptographic techniques to reduce the memory of our L1heavy hitters algorithm even further and to design a number of additional algorithms for graph, string, and linear algebra problems. The existence of such algorithms is surprising, as the streaming algorithm does not even have a secret key in this model, i.e., its state is entirely known to the adversary. One algorithm we design is for estimating the number of distinct elements in a stream with insertions and deletions achieving a multiplicative approximation and sublinear space; such an algorithm is impossible for deterministic algorithms. We also give a general technique that translates any twoplayer deterministic communication lower bound to a lower bound for randomized algorithms robust to a whitebox adversary. In particular, our results show that for all p β₯ 0, there exists a constant Cp > 1 such that any Cpapproximation algorithm for Fp moment estimation in insertiononly streams with a whitebox adversary requires β¦(n) space for a universe of size n. Similarly, there is a constant C > 1 such that any Capproximation algorithm in an insertiononly stream for matrix rank requires β¦(n) space with a whitebox adversary. These results do not contradict our upper bounds since they assume the adversary has unbounded computational power. Our algorithmic results based on cryptography thus show a separation between computationally bounded and unbounded adversaries. Finally, we prove a lower bound of β¦(log n) bits for the fundamental problem of deterministic approximate counting in a stream of 0βs and 1βs, which holds even if we know how many total stream updates we have seen so far at each point in the stream. Such a lower bound for approximate counting with additional information was previously unknown, and in our context, it shows a separation between multiplayer deterministic maximum communication and the whitebox space complexity of a streaming algorithmmore » « less

In bandit multiple hypothesis testing, each arm corresponds to a different null hypothesis that we wish to test, and the goal is to design adaptive algorithms that correctly identify large set of interesting arms (true discoveries), while only mistakenly identifying a few uninteresting ones (false discoveries). One common metric in nonbandit multiple testing is the false discovery rate (FDR). We propose a unified, modular framework for bandit FDR control that emphasizes the decoupling of exploration and summarization of evidence. We utilize the powerful martingalebased concept of βeprocessesβ to ensure FDR control for arbitrary composite nulls, exploration rules and stopping times in generic problem settings. In particular, valid FDR control holds even if the reward distributions of the arms could be dependent, multiple arms may be queried simultaneously, and multiple (cooperating or competing) agents may be querying arms, covering combinatorial semibandit type settings as well. Prior work has considered in great detail the setting where each armβs reward distribution is independent and subGaussian, and a single arm is queried at each step. Our framework recovers matching sample complexity guarantees in this special case, and performs comparably or better in practice. For other settings, sample complexities will depend on the finer details of the problem (composite nulls being tested, exploration algorithm, data dependence structure, stopping rule) and we do not explore these; our contribution is to show that the FDR guarantee is clean and entirely agnostic to these details.more » « less

We study a nonparametric contextual bandit problem in which the expected reward functions belong to a HΓΆlder class with smoothness parameter Ξ². We show how this interpolates between two extremes that were previously studied in isolation: nondifferentiable bandits (Ξ² at most 1), with which rateoptimal regret is achieved by running separate noncontextual bandits in different context regions, and parametricresponse bandits (infinite [Formula: see text]), with which rateoptimal regret can be achieved with minimal or no exploration because of infinite extrapolatability. We develop a novel algorithm that carefully adjusts to all smoothness settings, and we prove its regret is rateoptimal by establishing matching upper and lower bounds, recovering the existing results at the two extremes. In this sense, our work bridges the gap between the existing literature on parametric and nondifferentiable contextual bandit problems and between bandit algorithms that exclusively use global or local information, shedding light on the crucial interplay of complexity and regret in contextual bandits.more » « less

We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a nonlinear function of the rewards of the selected individual arms. The direct use of a multiarmed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for topK subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Γ(KβKNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of βlog 2NT. When applied to the problem of crossselling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of stateoftheart algorithms. We also show that DART significantly outperforms existing methods for both linear and nonlinear joint reward environments.more » « less