We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a nonlinear function of the rewards of the selected individual arms. The direct use of a multiarmed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for topK subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of crossselling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of stateoftheart algorithms. We also show that DART significantly outperforms existing methods for both linear and nonlinear joint reward environments.
more »
« less
Stochastic Top$K$ Subset Bandits with Linear Space and NonLinear Feedback
Many realworld problems like Social Influence Maximization face the dilemma of choosing the best $K$ out of $N$ options at a given time instant. This setup can be modeled as a combinatorial bandit which chooses $K$ out of $N$ arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandits where the feedback received can be a nonlinear function of the chosen $K$ arms. The direct use of multiarmed bandit requires choosing among $N$choose$K$ options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in $N$. The proposed algorithm is a divideandconquer based strategy, that we call CMABSM. Further, the proposed algorithm achieves a \textit{regret bound} of $\tilde O(K^{\frac{1}{2}}N^{\frac{1}{3}}T^{\frac{2}{3}})$ for a time horizon $T$, which is \textit{sublinear} in all parameters $T$, $N$, and $K$.
more »
« less
 Award ID(s):
 1742847
 NSFPAR ID:
 10309321
 Editor(s):
 Feldman, Vitaly; Ligett, Katrina; Sabato, Sivan
 Date Published:
 Journal Name:
 International Conference on Algorithmic Learning Theory
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Cussens, James ; Zhang, Kun (Ed.)We investigate the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and fullbandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, ExploreThenCommit Greedy (ETCG) and prove that it achieves a $(11/e)$regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and realworld data that the ETCG empirically outperforms other fullbandit methods.more » « less

We investigate the problem of unconstrained combinatorial multiarmed bandits with fullbandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$regret upper bound of $\Tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other fullbandit variants in submodular and nonsubmodular settings.more » « less

Federated bilevel learning has received increasing attention in various emerging machine learning and communication applications. Recently, several Hessianvectorbased algorithms have been proposed to solve the federated bilevel optimization problem. However, several important properties in federated learning such as the partial client participation and the linear speedup for convergence (i.e., the convergence rate and complexity are improved linearly with respect to the number of sampled clients) in the presence of noni.i.d.~datasets, still remain open. In this paper, we fill these gaps by proposing a new federated bilevel algorithm named FedMBO with a novel client sampling scheme in the federated hypergradient estimation. We show that FedMBO achieves a convergence rate of $\mathcal{O}\big(\frac{1}{\sqrt{nK}}+\frac{1}{K}+\frac{\sqrt{n}}{K^{3/2}}\big)$ on noni.i.d.~datasets, where $n$ is the number of participating clients in each round, and $K$ is the total number of iteration. This is the first theoretical linear speedup result for noni.i.d.~federated bilevel optimization. Extensive experiments validate our theoretical results and demonstrate the effectiveness of our proposed method.more » « less

We introduce the E$^4$ algorithm for the batched linear bandit problem, incorporating an ExploreEstimateEliminateExploit framework. With a proper choice of exploration rate, we prove E$^4$ achieves the finitetime minimax optimal regret with only $O(\log\log T)$ batches, and the asymptotically optimal regret with only $3$ batches as $T\rightarrow\infty$, where $T$ is the time horizon. We further prove a lower bound on the batch complexity of linear contextual bandits showing that any asymptotically optimal algorithm must require at least $3$ batches in expectation as $T\rightarrow\infty$, which indicates E$^4$ achieves the asymptotic optimality in regret and batch complexity simultaneously. To the best of our knowledge, E$^4$ is the first algorithm for linear bandits that simultaneously achieves the minimax and asymptotic optimality in regret with the corresponding optimal batch complexities. In addition, we show that with another choice of exploration rate E$^4$ achieves an instancedependent regret bound requiring at most $O(\log T)$ batches, and maintains the minimax optimality and asymptotic optimality. We conduct thorough experiments to evaluate our algorithm on randomly generated instances and the challenging \textit{End of Optimism} instances \citep{lattimore2017end} which were shown to be hard to learn for optimism based algorithms. Empirical results show that E$^4$ consistently outperforms baseline algorithms with respect to regret minimization, batch complexity, and computational efficiency.more » « less