We investigate the problem of unconstrained combinatorial multiarmed bandits with fullbandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$regret upper bound of $\Tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other fullbandit variants in submodular and nonsubmodular settings.
more »
« less
DART: Adaptive Accept Reject Algorithm for NonLinear Combinatorial Bandits
We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a nonlinear function of the rewards of the selected individual arms. The direct use of a multiarmed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for topK subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of crossselling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of stateoftheart algorithms. We also show that DART significantly outperforms existing methods for both linear and nonlinear joint reward environments.
more »
« less
 Award ID(s):
 1742847
 NSFPAR ID:
 10309322
 Date Published:
 Journal Name:
 Proceedings of the AAAI Conference on Artificial Intelligence
 ISSN:
 21595399
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We investigate the problem of unconstrained combinatorial multiarmed bandits with fullbandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a 1 2 regret upper bound of O˜(nT 2 3 ) for horizon T and number of arms n. We also show in experiments that RGL empirically outperforms other fullbandit variants in submodular and nonsubmodular settings.more » « less

Feldman, Vitaly ; Ligett, Katrina ; Sabato, Sivan (Ed.)Many realworld problems like Social Influence Maximization face the dilemma of choosing the best $K$ out of $N$ options at a given time instant. This setup can be modeled as a combinatorial bandit which chooses $K$ out of $N$ arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandits where the feedback received can be a nonlinear function of the chosen $K$ arms. The direct use of multiarmed bandit requires choosing among $N$choose$K$ options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in $N$. The proposed algorithm is a divideandconquer based strategy, that we call CMABSM. Further, the proposed algorithm achieves a \textit{regret bound} of $\tilde O(K^{\frac{1}{2}}N^{\frac{1}{3}}T^{\frac{2}{3}})$ for a time horizon $T$, which is \textit{sublinear} in all parameters $T$, $N$, and $K$.more » « less

Cussens, James ; Zhang, Kun (Ed.)We investigate the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and fullbandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, ExploreThenCommit Greedy (ETCG) and prove that it achieves a $(11/e)$regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and realworld data that the ETCG empirically outperforms other fullbandit methods.more » « less

In many realworld applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the ϵmultiplayer multiarmed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence boundbased algorithm, RobustAgg(ϵ), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise dissimilarities of reward distributions between players is known, we achieve instancedependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise dissimilarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.more » « less