We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a nonlinear function of the rewards of the selected individual arms. The direct use of a multiarmed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for topK subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of crossselling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that ofmore »
Stochastic Top$K$ Subset Bandits with Linear Space and NonLinear Feedback
Many realworld problems like Social Influence Maximization face the dilemma of choosing the best $K$ out of $N$ options at a given time instant. This setup can be modeled as a combinatorial bandit which chooses $K$ out of $N$ arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandits where the feedback received can be a nonlinear function of the chosen $K$ arms. The direct use of multiarmed bandit requires choosing among $N$choose$K$ options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in $N$. The proposed algorithm is a divideandconquer based strategy, that we call CMABSM. Further, the proposed algorithm achieves a \textit{regret bound} of $\tilde O(K^{\frac{1}{2}}N^{\frac{1}{3}}T^{\frac{2}{3}})$ for a time horizon $T$, which is \textit{sublinear} in all parameters $T$, $N$, and $K$.
 Editors:
 Feldman, Vitaly; Ligett, Katrina; Sabato, Sivan
 Award ID(s):
 1742847
 Publication Date:
 NSFPAR ID:
 10309321
 Journal Name:
 International Conference on Algorithmic Learning Theory
 Sponsoring Org:
 National Science Foundation
More Like this


Cussens, James ; Zhang, Kun (Ed.)We investigate the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and fullbandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, ExploreThenCommit Greedy (ETCG) and prove that it achieves a $(11/e)$regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and realworld data that the ETCG empirically outperforms other fullbandit methods.

We investigate the problem of unconstrained combinatorial multiarmed bandits with fullbandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$regret upper bound of $\Tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other fullbandit variants in submodular and nonsubmodular settings.

We consider a multiagent multiarmed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) łog (T) / Δ ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m łl K, this improves over the singleagent baseline regret of O(Kłog(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the stateoftheart algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in K and n . In light of this negative result, we propose a new algorithm for which the i th agent has regret O(( dmal (i) + K/n) łog(T)/Δ) on any connected and undirected graph, where dmal(i) is the number of i 's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where dmal(i) = m), and show the effect of malicious agents is entirely local (in the sense that only the dmal (i) maliciousmore »

We investigate the problem of unconstrained combinatorial multiarmed bandits with fullbandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a 1 2 regret upper bound of O˜(nT 2 3 ) for horizon T and number of arms n. We also show in experiments that RGL empirically outperforms other fullbandit variants in submodular and nonsubmodular settings.