In many realworld applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the ฯตmultiplayer multiarmed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence boundbased algorithm, RobustAgg(ฯต), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise dissimilarities of reward distributions between players is known, we achieve instancedependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise dissimilarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.
more »
« less
Competing Bandits in Time Varying Matching Markets
We study the problem of online learning in twosided nonstationary matching markets, where the objective is to converge to a stable match. In particular, we consider the setting where one side of the market, the arms, has fixed known set of preferences over the other side, the players. While this problem has been studied when the players have fixed but unknown preferences, in this work we study the problem of how to learn when the preferences of the players are time varying and unknown. Our contribution is a methodology that can handle any type of preference structure and variation scenario. We show that, with the proposed algorithm, each player receives a uniform sublinear regret of {Oห(๐ฟ1/2๐๐1/2)} up to the number of changes in the underlying preferences of the agents, ๐ฟ๐. Therefore, we show that the optimal rates for singleagent learning can be achieved in spite of the competition up to a difference of a constant factor. We also discuss extensions of this algorithm to the case where the number of changes need not be known a priori.
more »
« less
 Award ID(s):
 2125913
 NSFPAR ID:
 10434399
 Date Published:
 Journal Name:
 The 5th Annual Learning for Dynamics and Control Conference
 Volume:
 211
 Page Range / eLocation ID:
 10201031
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


A fundamental task in active learning involves performing a sequence of tests to identify an unknown hypothesis that is drawn from a known distribution. This problem, known as optimal decision tree induction, has been widely studied for decades and the asymptotically bestpossible approximation algorithm has been devised for it. We study a generalization where certain test outcomes are noisy, even in the more general case when the noise is persistent, i.e., repeating the test on the scenario gives the same noisy output, disallowing simple repetition as a way to gain confidence. We design new approximation algorithms for both the nonadaptive setting, where the test sequence must be fixed apriori, and the adaptive setting where the test sequence depends on the outcomes of prior tests. Previous work in the area assumed at most a constant number of noisy outcomes per test and per scenario and provided approximation ratios that were problem dependent (such as the minimum probability of a hypothesis). Our new approximation algorithms provide guarantees that are nearly bestpossible and work for the general case of a large number of noisy outcomes per test or per hypothesis where the performance degrades smoothly with this number. Our results adapt and generalize methods used for submodular ranking and stochastic set cover. We evaluate the performance of our algorithms on two natural applications with noise: toxic chemical identification and active learning of linear classifiers. Despite our logarithmic theoretical approximation guarantees, our methods give solutions with cost very close to the information theoretic minimum, demonstrating the effectiveness of our methods.more » « less

null (Ed.)We study the maxaffine regression model, where the unknown regression function is modeled as a maximum of a fixed number of affine functions. In recent work [1], we showed that endtoend parameter estimates were obtainable using this model with an alternating minimization (AM) algorithm provided the covariates (or designs) were normally distributed, and chosen independently of the underlying parameters. In this paper, we show that AM is significantly more robust than the setting of [1]: It converges locally under smallball design assumptions (which is a much broader class, including bounded logconcave distributions), and even when the underlying parameters are chosen with knowledge of the realized covariates. Once again, the final rate obtained by the procedure is nearparametric and minimax optimal (up to a polylogarithmic factor) as a function of the dimension, sample size, and noise variance. As a byproduct of our analysis, we obtain convergence guarantees on a classical algorithm for the (real) phase retrieval problem in the presence of noise under considerably weaker assumptions on the design distribution than was previously known.more » « less

We consider a stochastic inventory control problem under censored demand, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near optimality of a simple class of policies called โbasestock policiesโ as well as the convexity of longrun average cost under those policies. We consider a relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound the regret of the algorithm when compared with the best basestock policy. Our main contribution is a learning algorithm with a regret bound of [Formula: see text] for the inventory control problem. Here, [Formula: see text] is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the expected number of time steps needed to generate enough demand to deplete one unit of inventory. Notably, our regret bounds depend linearly on L, which significantly improves the previously bestknown regret bounds for this problem where the dependence on L was exponential. Our techniques utilize the convexity of the longrun average cost and a newly derived bound on the โbiasโ of basestock policies to establish an almost black box connection between the problem of learning in Markov decision processes (MDPs) with these properties and the stochastic convex bandit problem. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex asymptotic average cost functions.more » « less

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and nonstationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially suboptimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive nonstationarity detection strategy, which builds on an approach recently developed for contextual Multiarmed Bandit problems. We also argue that nonadaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is nonstationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.more » « less