skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Active Search using Meta-Bandits
There are many applications where positive instances are rare but important to identify. For example, in NLP, positive sentences for a given relation are rare in a large corpus. Positive data are more informative for learning in these applications, but before one labels a certain amount of data, it is unknown where to find the rare positives. Since random sampling can lead to significant waste in labeling effort, previous ”active search” methods use a single bandit model to learn about the data distribution (exploration) while sampling from the regions potentially containing more positives (exploitation). Many bandit models are possible and a sub-optimal model reduces labeling efficiency, but the optimal model is unknown before any data are labeled. We propose Meta-AS (Meta Active Search) that uses a meta-bandit to evaluate a set of base bandits and aims to label positive examples efficiently, comparing to the optimal base bandit with hindsight. The meta-bandit estimates the mean and variance of the performance of the base bandits, and selects a base bandit to propose what data to label next for exploration or exploitation. The feedback in the labels updates both the base bandits and the meta-bandit for the next round. Meta-AS can accommodate a diverse set of base bandits to explore assumptions about the dataset, without over-committing to a single model before labeling starts. Experiments on five datasets for relation extraction demonstrate that Meta-AS labels positives more efficiently than the base bandits and other bandit selection strategies.  more » « less
Award ID(s):
1931042 1757787
PAR ID:
10184358
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
The 29th ACM International Conference on Information and Knowledge Management (CIKM'2020)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Prophet Inequality and Pandora's Box problems are fundamental stochastic problem with applications in Mechanism Design, Online Algorithms, Stochastic Optimization, Optimal Stopping, and Operations Research. A usual assumption in these works is that the probability distributions of the n underlying random variables are given as input to the algorithm. Since in practice these distributions need to be learned under limited feedback, we initiate the study of such stochastic problems in the Multi-Armed Bandits model. In the Multi-Armed Bandits model we interact with n unknown distributions over T rounds: in round t we play a policy x(t) and only receive the value of x(t) as feedback. The goal is to minimize the regret, which is the difference over T rounds in the total value of the optimal algorithm that knows the distributions vs. the total value of our algorithm that learns the distributions from the limited feedback. Our main results give near-optimal Õ (poly (n) √T) total regret algorithms for both Prophet Inequality and Pandora's Box. Our proofs proceed by maintaining confidence intervals on the unknown indices of the optimal policy. The exploration-exploitation tradeoff prevents us from directly refining these confidence intervals, so the main technique is to design a regret upper bound function that is learnable while playing low-regret Bandit policies. 
    more » « less
  2. In this paper, we propose and study opportunistic contextual bandits - a special case of contextual bandits where the exploration cost varies under different environmental conditions, such as network load or return variation in recommendations. When the exploration cost is low, so is the actual regret of pulling a sub-optimal arm (e.g., trying a suboptimal recommendation). Therefore, intuitively, we could explore more when the exploration cost is relatively low and exploit more when the exploration cost is relatively high. Inspired by this intuition, for opportunistic contextual bandits with Linear payoffs, we propose an Adaptive Upper-Confidence-Bound algorithm (AdaLinUCB) to adaptively balance the exploration-exploitation trade-off for opportunistic learning. We prove that AdaLinUCB achieves O((log T)^2) problem-dependent regret upper bound, which has a smaller coefficient than that of the traditional LinUCB algorithm. Moreover, based on both synthetic and real-world dataset, we show that AdaLinUCB significantly outperforms other contextual bandit algorithms, under large exploration cost fluctuations. 
    more » « less
  3. In this paper, we propose and study opportunistic bandits - a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (AdaUCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuation. 
    more » « less
  4. null (Ed.)
    Collaborative bandit learning, i.e., bandit algorithms that utilize collaborative filtering techniques to improve sample efficiency in online interactive recommendation, has attracted much research attention as it enjoys the best of both worlds. However, all existing collaborative bandit learning solutions impose a stationary assumption about the environment, i.e., both user preferences and the dependency among users are assumed static over time. Unfortunately, this assumption hardly holds in practice due to users' ever-changing interests and dependency relations, which inevitably costs a recommender system sub-optimal performance in practice. In this work, we develop a collaborative dynamic bandit solution to handle a changing environment for recommendation. We explicitly model the underlying changes in both user preferences and their dependency relation as a stochastic process. Individual user's preference is modeled by a mixture of globally shared contextual bandit models with a Dirichlet process prior. Collaboration among users is thus achieved via Bayesian inference over the global bandit models. To balance exploitation and exploration during the interactions, Thompson sampling is used for both model selection and arm selection. Our solution is proved to maintain a standard $$\tilde O(\sqrt{T})$$ Bayesian regret in this challenging environment. Extensive empirical evaluations on both synthetic and real-world datasets further confirmed the necessity of modeling a changing environment and our algorithm's practical advantages against several state-of-the-art online learning solutions. 
    more » « less
  5. We study model selection in linear bandits, where the learner must adapt to the dimension (denoted by 𝑑⋆ ) of the smallest hypothesis class containing the true linear model while balancing exploration and exploitation. Previous papers provide various guarantees for this model selection problem, but have limitations; i.e., the analysis requires favorable conditions that allow for inexpensive statistical testing to locate the right hypothesis class or are based on the idea of “corralling” multiple base algorithms, which often performs relatively poorly in practice. These works also mainly focus on upper bounds. In this paper, we establish the first lower bound for the model selection problem. Our lower bound implies that, even with a fixed action set, adaptation to the unknown dimension 𝑑⋆ comes at a cost: There is no algorithm that can achieve the regret bound 𝑂˜(𝑑⋆𝑇‾‾‾‾√) simultaneously for all values of 𝑑⋆ . We propose Pareto optimal algorithms that match the lower bound. Empirical evaluations show that our algorithm enjoys superior performance compared to existing ones. 
    more » « less