In the Hidden-Parameter MDP (HiP-MDP) framework, a family of reinforcement learning tasks is generated by varying hidden parameters specifying the dynamics and reward function for each individual task. The HiP-MDP is a natural model for families of tasks in which meta- and lifelong-reinforcement learning approaches can succeed. Given a learned context encoder that infers the hidden parameters from previous experience, most existing algorithms fall into two categories: model transfer and policy transfer, depending on which function the hidden parameters are used to parameterize. We characterize the robustness of model and policy transfer algorithms with respect to hidden parameter estimation error. We first show that the value function of HiP-MDPs is Lipschitz continuous under certain conditions. We then derive regret bounds for both settings through the lens of Lipschitz continuity. Finally, we empirically corroborate our theoretical analysis by varying the hyper-parameters governing the Lipschitz constants of two continuous control problems; the resulting performance is consistent with our theoretical results.
more »
« less
Transfer Learning in Bandits with Latent Continuity
A continuity structure of correlations among arms in multi-armed bandit can bring a significant acceleration of exploration and reduction of regret, in particular, when there are many arms. However, it is often latent in practice. To cope with the latent continuity, we consider a transfer learning setting where an agent learns the structural information, parameterized by a Lipschitz constant and an embedding of arms, from a sequence of past tasks and transfers it to a new one. We propose a simple but provably-efficient algorithm to accurately estimate and fully exploit the Lipschitz continuity at the same asymptotic order of lower bound of sample complexity in the previous tasks. The proposed algorithm is applicable to estimate not only a latent Lipschitz constant given an embedding, but also a latent embedding, while the latter requires slightly more sample complexity. To be specific, we analyze the efficiency of the proposed framework in two folds: (i) our regret bound on the new task is close to that of the oracle algorithm with the full knowledge of the Lipschitz continuity under mild assumptions; and (ii) the sample complexity of our estimator matches with the information-theoretic fundamental limit. Our analysis reveals a set of useful insights on transfer learning for latent Lipschitz continuity. From a numerical evaluation based on real-world dataset of rate adaptation in time-varying wireless channel, we demonstrate the theoretical findings and show the superiority of the proposed framework compared to baselines.
more »
« less
- Award ID(s):
- 2327013
- PAR ID:
- 10546614
- Editor(s):
- BARG, ALEXANDER; Sason, Igal; Loeliger, Hans-Andrea; Richardson, Tom; Vardy, Alexander; Wornell, Gregory
- Publisher / Repository:
- Institute of Electrical and Electronics Engineers
- Date Published:
- Journal Name:
- IEEE Transactions on Information Theory
- ISSN:
- 0018-9448
- Subject(s) / Keyword(s):
- Multi-armed bandits Lipschitz continuity transfer learning wireless rate adaptation
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of the first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss {\it and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.more » « less
-
In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the ϵ-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg(ϵ), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise dissimilarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise dissimilarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.more » « less
-
Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $$\tilde O\big(d\sqrt{\sum_{t=1}^T\sigma_t^2} + d\big)$$, where $$\sigma_t$$ is the variance of the pairwise comparison in round $$t$$, $$d$$ is the dimension of the context vectors, and $$T$$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $$\tilde O(d)$$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.more » « less
-
We study multi-task representation learning for the problem of pure exploration in bilinear bandits. In bilinear bandits, an action takes the form of a pair of arms from two different entity types and the reward is a bilinear function of the known feature vectors of the arms. In the \textit{multi-task bilinear bandit problem}, we aim to find optimal actions for multiple tasks that share a common low-dimensional linear representation. The objective is to leverage this characteristic to expedite the process of identifying the best pair of arms for all tasks. We propose the algorithm GOBLIN that uses an experimental design approach to optimize sample allocations for learning the global representation as well as minimize the number of samples needed to identify the optimal pair of arms in individual tasks. To the best of our knowledge, this is the first study to give sample complexity analysis for pure exploration in bilinear bandits with shared representation. Our results demonstrate that by learning the shared representation across tasks, we achieve significantly improved sample complexity compared to the traditional approach of solving tasks independently.more » « less
An official website of the United States government

