Thompson sampling has become a ubiquitous ap- proach to online decision problems with bandit feedback. The key algorithmic task for Thomp- son sampling is drawing a sample from the pos- terior of the optimal action. We propose an al- ternative arm selection rule we dub TS-UCB, that requires negligible additional computational effort but provides significant performance im- provements relative to Thompson sampling. At each step, TS-UCB computes a score for each arm using two ingredients: posterior sample(s) and upper confidence bounds. TS-UCB can be used in any setting where these two quantities are available, and it is flexible in the number of posterior samples it takes as input. TS-UCB achieves materially lower regret on a comprehen- sive suite of synthetic and real-world datasets, including a personalized article recommendation dataset from Yahoo! and a suite of benchmark datasets from a deep bandit suite proposed in Riquelme et al. (2018). Finally, from a theoreti- cal perspective, we establish optimal regret guar- antees for TS-UCB for both the K-armed and linear bandit models.
more »
« less
Thompson Sampling for Robust Transfer in Multi-Task Bandits
We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer.
more »
« less
- Award ID(s):
- 1915734
- PAR ID:
- 10441033
- Date Published:
- Journal Name:
- Proceedings of the 39th International Conference on Machine Learning
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We investigate robust data aggregation in a multi-agent online learning setting. In reality, multiple online learning agents are often deployed to perform similar tasks and receive similar feedback. We study how agents can improve their collective performance by sharing information among each other. In this paper, we formulate the ε-multi-player multi-armed bandit problem, in which a set of M players that have similar reward distributions for each arm play concurrently. We develop an upper confidence bound-based algorithm that adaptively aggregates rewards collected by different players. To our best knowledge, we are the first to develop such a scheme in a multi-player bandit learning setting. We show that under the assumption that pairwise distances between the means of the player-dependent distributions for each arm are small, we improve the (collective) regret bound by nearly a factor of M , in comparison with a baseline algorithm in which the players learn individually using the UCB-1 algorithm (Auer et al., 2002). Our algorithm also exhibits a fallback guarantee, namely, if our task similarity assumption fails to hold, our algorithm still has a performance guarantee that cannot be worse than the baseline by a constant factor. Empirically, we validate our algorithm on synthetic data.more » « less
-
Li, Y; Mandt, S; Agrawal, S; Khan, E (Ed.)We study the problem of representational transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in finding a good policy for a target task. Unlike in online RL where the agent interacts with the environment while learning a policy, in the offline setting there cannot be such interactions in either the source tasks or the target task; thus multi-task offline RL can suffer from incomplete coverage. We propose an algorithm to compute pointwise uncertainty measures for the learnt representation in low-rank MDPs, and establish a data-dependent upper bound for the suboptimality of the learnt policy for the target task. Our algorithm leverages the collective exploration done by source tasks to mitigate poor coverage at some points by a few tasks, thus overcoming the limitation of needing uniformly good coverage for a meaningful transfer by existing offline algorithms. We complement our theoretical results with empirical evaluation on a rich-observation MDP which requires many samples for complete coverage. Our findings illustrate the benefits of penalizing and quantifying the uncertainty in the learnt representation.more » « less
-
ABSTRACT This paper studies the transfer learning problem for convolutional neural network models. A phase transition phenomenon has been empirically validated: The convolutional layer shifts from general to specific with respect to the target task as its depth increases. The paper suggests measuring the generality of convolutional layers through an easy‐to‐compute and tuning‐free statistic named projection correlation. The non‐asymptotic upper bounds for the estimation error of the proposed generality measure have been provided. Based on this generality measure, the paper proposes a forward‐adding‐layer‐selection algorithm to select general layers. The algorithm aims to find a cut‐off in the pre‐trained model according to where the phase transition from general to specific happens. Then, we propose to transfer only the general layers as specific layers can cause overfitting issues and hence hurt the prediction performance. The proposed algorithm is computationally efficient and can consistently estimate the true beginning of phase transition under mild conditions. Its superior empirical performance has been justified by various numerical experiments.more » « less
-
A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.more » « less
An official website of the United States government

