Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Free, publiclyaccessible full text available January 1, 2024

We investigate the problem of unconstrained combinatorial multiarmed bandits with fullbandit feedback and stochastic rewards for submodular maximization. Previous works investigate the same problem assuming a submodular and monotone reward function. In this work, we study a more general problem, i.e., when the reward function is not necessarily monotone, and the submodularity is assumed only in expectation. We propose Randomized Greedy Learning (RGL) algorithm and theoretically prove that it achieves a $\frac{1}{2}$regret upper bound of $\Tilde{\mathcal{O}}(n T^{\frac{2}{3}})$ for horizon $T$ and number of arms $n$. We also show in experiments that RGL empirically outperforms other fullbandit variants in submodular and nonsubmodular settings.Free, publiclyaccessible full text available January 1, 2024

Cussens, James ; Zhang, Kun (Ed.)We investigate the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and fullbandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, ExploreThenCommit Greedy (ETCG) and prove that it achieves a $(11/e)$regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and realworld data that the ETCG empirically outperforms other fullbandit methods.

We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a nonlinear function of the rewards of the selected individual arms. The direct use of a multiarmed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for topK subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of crossselling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that ofmore »

Feldman, Vitaly ; Ligett, Katrina ; Sabato, Sivan (Ed.)Many realworld problems like Social Influence Maximization face the dilemma of choosing the best $K$ out of $N$ options at a given time instant. This setup can be modeled as a combinatorial bandit which chooses $K$ out of $N$ arms at each time, with an aim to achieve an efficient tradeoff between exploration and exploitation. This is the first work for combinatorial bandits where the feedback received can be a nonlinear function of the chosen $K$ arms. The direct use of multiarmed bandit requires choosing among $N$choose$K$ options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in $N$. The proposed algorithm is a divideandconquer based strategy, that we call CMABSM. Further, the proposed algorithm achieves a \textit{regret bound} of $\tilde O(K^{\frac{1}{2}}N^{\frac{1}{3}}T^{\frac{2}{3}})$ for a time horizon $T$, which is \textit{sublinear} in all parameters $T$, $N$, and $K$.

The recent framework for tiered spectrum sharing in the 3.5 GHz band establishes rules in which multiple firms called Environment Sensing Capability operators (ESCs) may measure spectrum occupancy and sell these measurements to other firms to help facilitate spectrum access. Motived by this we consider a scenario in which two spectrum access firms (SAs) seeks to access a shared band of spectrum and must in turn purchase spectrum measurements from one of two ESCs. Given the measurements they purchase, the SA firms then compete on price to serve customers in a shared band of spectrum. We study how differences in the quality and price of the spectrum measurements impact the resulting market equilibrium between the SAs and find that having different qualities of measurements available to different SAs can lead to better economic welfare.