skip to main content


Title: An explore-then-commit algorithm for submodular maximization under full-bandit feedback
We investigate the problem of combinatorial multi-armed bandits with stochastic submodular (in expectation) rewards and full-bandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, Explore-Then-Commit Greedy (ETCG) and prove that it achieves a $(1-1/e)$-regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and real-world data that the ETCG empirically outperforms other full-bandit methods.  more » « less
Award ID(s):
2149617
NSF-PAR ID:
10397867
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Cussens, James; Zhang, Kun
Date Published:
Journal Name:
Uncertainty in Artificial Intelligence
Volume:
180
ISSN:
1525-3384
Page Range / eLocation ID:
1541-1551
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We investigate the problem of unconstrained combinatorial multi-armed bandits with full-bandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$-regret upper bound of $\Tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings. 
    more » « less
  2. Feldman, Vitaly ; Ligett, Katrina ; Sabato, Sivan (Ed.)
    Many real-world problems like Social Influence Maximization face the dilemma of choosing the best $K$ out of $N$ options at a given time instant. This setup can be modeled as a combinatorial bandit which chooses $K$ out of $N$ arms at each time, with an aim to achieve an efficient trade-off between exploration and exploitation. This is the first work for combinatorial bandits where the feedback received can be a non-linear function of the chosen $K$ arms. The direct use of multi-armed bandit requires choosing among $N$-choose-$K$ options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in $N$. The proposed algorithm is a divide-and-conquer based strategy, that we call CMAB-SM. Further, the proposed algorithm achieves a \textit{regret bound} of $\tilde O(K^{\frac{1}{2}}N^{\frac{1}{3}}T^{\frac{2}{3}})$ for a time horizon $T$, which is \textit{sub-linear} in all parameters $T$, $N$, and $K$. 
    more » « less
  3. Abstract

    Given a sequence $\{Z_d\}_{d\in \mathbb{N}}$ of smooth and compact hypersurfaces in ${\mathbb{R}}^{n-1}$, we prove that (up to extracting subsequences) there exists a regular definable hypersurface $\Gamma \subset {\mathbb{R}}\textrm{P}^n$ such that each manifold $Z_d$ is diffeomorphic to a component of the zero set on $\Gamma$ of some polynomial of degree $d$. (This is in sharp contrast with the case when $\Gamma$ is semialgebraic, where for example the homological complexity of the zero set of a polynomial $p$ on $\Gamma$ is bounded by a polynomial in $\deg (p)$.) More precisely, given the above sequence of hypersurfaces, we construct a regular, compact, semianalytic hypersurface $\Gamma \subset {\mathbb{R}}\textrm{P}^{n}$ containing a subset $D$ homeomorphic to a disk, and a family of polynomials $\{p_m\}_{m\in \mathbb{N}}$ of degree $\deg (p_m)=d_m$ such that $(D, Z(p_m)\cap D)\sim ({\mathbb{R}}^{n-1}, Z_{d_m}),$ i.e. the zero set of $p_m$ in $D$ is isotopic to $Z_{d_m}$ in ${\mathbb{R}}^{n-1}$. This says that, up to extracting subsequences, the intersection of $\Gamma$ with a hypersurface of degree $d$ can be as complicated as we want. We call these ‘pathological examples’. In particular, we show that for every $0 \leq k \leq n-2$ and every sequence of natural numbers $a=\{a_d\}_{d\in \mathbb{N}}$ there is a regular, compact semianalytic hypersurface $\Gamma \subset {\mathbb{R}}\textrm{P}^n$, a subsequence $\{a_{d_m}\}_{m\in \mathbb{N}}$ and homogeneous polynomials $\{p_{m}\}_{m\in \mathbb{N}}$ of degree $\deg (p_m)=d_m$ such that (0.1)$$\begin{equation}b_k(\Gamma\cap Z(p_m))\geq a_{d_m}.\end{equation}$$ (Here $b_k$ denotes the $k$th Betti number.) This generalizes a result of Gwoździewicz et al. [13]. On the other hand, for a given definable $\Gamma$ we show that the Fubini–Study measure, in the Gaussian probability space of polynomials of degree $d$, of the set $\Sigma _{d_m,a, \Gamma }$ of polynomials verifying (0.1) is positive, but there exists a constant $c_\Gamma$ such that $$\begin{equation*}0<{\mathbb{P}}(\Sigma_{d_m, a, \Gamma})\leq \frac{c_{\Gamma} d_m^{\frac{n-1}{2}}}{a_{d_m}}.\end{equation*}$$ This shows that the set of ‘pathological examples’ has ‘small’ measure (the faster $a$ grows, the smaller the measure and pathologies are therefore rare). In fact we show that given $\Gamma$, for most polynomials a Bézout-type bound holds for the intersection $\Gamma \cap Z(p)$: for every $0\leq k\leq n-2$ and $t>0$: $$\begin{equation*}{\mathbb{P}}\left(\{b_k(\Gamma\cap Z(p))\geq t d^{n-1} \}\right)\leq \frac{c_\Gamma}{td^{\frac{n-1}{2}}}.\end{equation*}$$

     
    more » « less
  4. We investigate the problem of unconstrained combinatorial multi-armed bandits with fullbandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a 1 2 -regret upper bound of O˜(nT 2 3 ) for horizon T and number of arms n. We also show in experiments that RGL empirically outperforms other full-bandit variants in submodular and non-submodular settings. 
    more » « less
  5. null (Ed.)
    In this paper, we study Federated Bandit, a decentralized Multi-Armed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm \textttGossip\_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that \textttGossip\_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(\max\ \textttpoly (N,M) łog T, \textttpoly (N,M)łog_łambda_2^-1 N\ ) for all N agents, where łambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose \textttFed\_UCB, a differentially private version of \textttGossip\_UCB, in which the agents preserve ε-differential privacy of their local data while achieving O(\max \\frac\textttpoly (N,M) ε łog^2.5 T, \textttpoly (N,M) (łog_łambda_2^-1 N + łog T) \ ) regret. 
    more » « less