In this paper, we propose and study opportunistic contextual bandits  a special case of contextual bandits where the exploration cost varies under different environmental conditions, such as network load or return variation in recommendations. When the exploration cost is low, so is the actual regret of pulling a suboptimal arm (e.g., trying a suboptimal recommendation). Therefore, intuitively, we could explore more when the exploration cost is relatively low and exploit more when the exploration cost is relatively high. Inspired by this intuition, for opportunistic contextual bandits with Linear payoffs, we propose an Adaptive UpperConfidenceBound algorithm (AdaLinUCB) to adaptively balance the explorationexploitation tradeoff for opportunistic learning. We prove that AdaLinUCB achieves O((log T)^2) problemdependent regret upper bound, which has a smaller coefficient than that of the traditional LinUCB algorithm. Moreover, based on both synthetic and realworld dataset, we show that AdaLinUCB significantly outperforms other contextual bandit algorithms, under large exploration cost fluctuations.
Adaptive ExplorationExploitation Tradeoff for Opportunistic Bandits
In this paper, we propose and study opportunistic
bandits  a new variant of bandits where the
regret of pulling a suboptimal arm varies under
different environmental conditions, such as network
load or produce price. When the load/price
is low, so is the cost/regret of pulling a suboptimal
arm (e.g., trying a suboptimal network
configuration). Therefore, intuitively, we could
explore more when the load/price is low and
exploit more when the load/price is high. Inspired
by this intuition, we propose an Adaptive
UpperConfidenceBound (AdaUCB) algorithm
to adaptively balance the explorationexploitation
tradeoff for opportunistic bandits. We prove that
AdaUCB achieves O(log T) regret with a smaller
coefficient than the traditional UCB algorithm.
Furthermore, AdaUCB achieves O(1) regret with
respect to T if the exploration cost is zero when
the load level is below a certain threshold. Last,
based on both synthetic data and realworld traces,
experimental results show that AdaUCB significantly
outperforms other bandit algorithms, such
as UCB and TS (Thompson Sampling), under
large load/price fluctuation.
more »
« less
 NSFPAR ID:
 10072465
 Date Published:
 Journal Name:
 ICML 2018
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We consider the problem where N agents collaboratively interact with an instance of a stochastic K arm bandit problem for K N. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of T time steps, the number of communication rounds, and the number of bits in each communication round. We present Limited Communication Collaboration  Upper Confidence Bound (LCCUCB), a doublingepoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows. With our algorithm, LCCUCB, each agent enjoys a regret of O√(K/N + N)T, communicates for O(log T) steps and broadcasts O(log K) bits in each communication step. We extend the work to sparse graphs with maximum degree KG and diameter D to propose LCCUCBGRAPH which enjoys a regret bound of O√(D(K/N + KG)DT). Finally, we empirically show that the LCCUCB and the LCCUCBGRAPH algorithms perform well and outperform strategies that communicate through a central node.more » « less

We introduce the E$^4$ algorithm for the batched linear bandit problem, incorporating an ExploreEstimateEliminateExploit framework. With a proper choice of exploration rate, we prove E$^4$ achieves the finitetime minimax optimal regret with only $O(\log\log T)$ batches, and the asymptotically optimal regret with only $3$ batches as $T\rightarrow\infty$, where $T$ is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least $3$ batches in expectation as $T\rightarrow\infty$, which indicates E$^4$ achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E$^4$ is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E$^4$ achieves an instancedependent regret bound requiring at most $O(\log T)$ batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging \textit{End of Optimism} instances \citep{lattimore2017end} which were shown to be hard to learn for optimism based algorithms. Empirical results show that E$^4$ consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.more » « less

Chaudhuri, Kamalika and (Ed.)We study the problem of reinforcement learning (RL) with low (policy) switching cost {—} a problem wellmotivated by reallife RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stagewise exploration and adaptive policy elimination that achieves a regret of $\widetilde{O}(\sqrt{H^4S^2AT})$ while requiring a switching cost of $O(HSA \log\log T)$. This is an exponential improvement over the bestknown switching cost $O(H^2SA\log T)$ among existing methods with $\widetilde{O}(\mathrm{poly}(H,S,A)\sqrt{T})$ regret. In the above, $S,A$ denotes the number of states and actions in an $H$horizon episodic Markov Decision Process model with unknown transitions, and $T$ is the number of steps. As a byproduct of our new techniques, we also derive a rewardfree exploration algorithm with a switching cost of $O(HSA)$. Furthermore, we prove a pair of informationtheoretical lower bounds which say that (1) Any noregret algorithm must have a switching cost of $\Omega(HSA)$; (2) Any $\widetilde{O}(\sqrt{T})$ regret algorithm must incur a switching cost of $\Omega(HSA\log\log T)$. Both our algorithms are thus optimal in their switching costs.more » « less

Krause, Andreas and (Ed.)We consider the problem of global optimization with noisy zeroth order oracles — a wellmotivated problem useful for various applications ranging from hyperparameter tuning for deep learning to new material design. Existing work relies on Gaussian processes or other nonparametric family, which suffers from the curse of dimensionality. In this paper, we propose a new algorithm GOUCB that leverages a parametric family of functions (e.g., neural networks) instead. Under a realizable assumption and a few other mild geometric conditions, we show that GOUCB achieves a cumulative regret of $\tilde{O}(\sqrt{T})$ where $T$ is the time horizon. At the core of GOUCB is a carefully designed uncertainty set over parameters based on gradients that allows optimistic exploration. Synthetic and realworld experiments illustrate GOUCB works better than popular Bayesian optimization approaches, even if the model is misspecified.more » « less