We consider the problem where N agents collaboratively interact with an instance of a stochastic K arm bandit problem for K N. The agents aim to simultaneously minimize the cumulative regret over all the agents for a total of T time steps, the number of communication rounds, and the number of bits in each communication round. We present Limited Communication Collaboration  Upper Confidence Bound (LCCUCB), a doublingepoch based algorithm where each agent communicates only after the end of the epoch and shares the index of the best arm it knows. With our algorithm, LCCUCB, each agent enjoys a regret of O√(K/N + N)T, communicates for O(log T) steps and broadcasts O(log K) bits in each communication step. We extend the work to sparse graphs with maximum degree KG and diameter D to propose LCCUCBGRAPH which enjoys a regret bound of O√(D(K/N + KG)DT). Finally, we empirically show that the LCCUCB and the LCCUCBGRAPH algorithms perform well and outperform strategies that communicate through a central node.
more »
« less
Global Optimization with Parametric Function Approximation
We consider the problem of global optimization with noisy zeroth order oracles — a wellmotivated problem useful for various applications ranging from hyperparameter tuning for deep learning to new material design. Existing work relies on Gaussian processes or other nonparametric family, which suffers from the curse of dimensionality. In this paper, we propose a new algorithm GOUCB that leverages a parametric family of functions (e.g., neural networks) instead. Under a realizable assumption and a few other mild geometric conditions, we show that GOUCB achieves a cumulative regret of $\tilde{O}(\sqrt{T})$ where $T$ is the time horizon. At the core of GOUCB is a carefully designed uncertainty set over parameters based on gradients that allows optimistic exploration. Synthetic and realworld experiments illustrate GOUCB works better than popular Bayesian optimization approaches, even if the model is misspecified.
more »
« less
 Award ID(s):
 2134214
 NSFPAR ID:
 10467176
 Editor(s):
 Krause, Andreas and
 Publisher / Repository:
 Proceedings of the 40th International Conference on Machine Learning
 Date Published:
 Journal Name:
 Proceedings of Machine Learning Research
 Volume:
 202
 ISSN:
 26403498
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


In blackbox optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In reallife, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multifidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noisetolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multifidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayedUCB1 (DUCB1) and delayedUCBV (DUCBV) algorithms. Given a horizon T, PCTS retains the regret bound of nondelayed HOO for expected delay of O(log T), and worsens by T^((1α)/(d+2)) for expected delays of O(T^(1α)) for α ∈ (0,1]. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the stateoftheart blackbox optimization methods for feedbacks with different noise levels, delays, and fidelity.more » « less

In this paper, we propose and study opportunistic bandits  a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive UpperConfidenceBound (AdaUCB) algorithm to adaptively balance the explorationexploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and realworld traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuation.more » « less

We consider a prototypical path planning problem on a graph with uncertain cost of mobility on its edges. At a given node, the planning agent can access the true cost for edges to its neighbors and uses a noisy simulator to estimate the costtogo from the neighboring nodes. The objective of the planning agent is to select a neighboring node such that, with high probability, the costtogo is minimized for the worst possible realization of uncertain parameters in the simulator. By modeling the costtogo as a Gaussian process (GP) for every realization of the uncertain parameters, we apply a scenario approach in which we draw fixed independent samples of the uncertain parameter. We present a scenariobased iterative algorithm using the upper confidence bound (UCB) of the fixed independent scenarios to compute the choice of the neighbor to go to. We characterize the performance of the proposed algorithm in terms of a novel notion of regret defined with respect to an additional draw of the uncertain parameter, termed as scenario regret under redraw. In particular, we characterize a high probability upper bound on the regret under redraw for any finite number of iterations of the algorithm, and show that this upper bound tends to zero asymptotically with the number of iterations. We supplement our analysis with numerical results.more » « less

Yin, George (Ed.)We consider a discrete time stochastic Markovian control problem under model uncertainty. Such uncertainty not only comes from the fact that the true probability law of the underlying stochastic process is unknown, but the parametric family of probability distributions which the true law belongs to is also unknown. We propose a nonparametric adaptive robust control methodology to deal with such problem where the relevant system random noise is, for simplicity, assumed to be i.i.d. and onedimensional. Our approach hinges on the following building concepts: first, using the adaptive robust paradigm to incorporate online learning and uncertainty reduction into the robust control problem; second, learning the unknown probability law through the empirical distribution, and representing uncertainty reduction in terms of a sequence of Wasserstein balls around the empirical distribution; third, using Lagrangian duality to convert the optimization over Wasserstein balls to a scalar optimization problem, and adopting a machine learning technique to achieve efficient computation of the optimal control. We illustrate our methodology by considering a utility maximization problem. Numerical comparisons show that the nonparametric adaptive robust control approach is preferable to the traditional robust frameworksmore » « less