One of the primary challenges in largescale distributed learning stems from stringent communication constraints. While several recent works address this challenge for static optimization problems, sequential decisionmaking under uncertainty has remained much less explored in this regard. Motivated by this gap, we introduce a new linear stochastic bandit formulation over a bitconstrained channel. Specifically, in our setup, an agent interacting with an environment transmits encoded estimates of an unknown model parameter to a server over a communication channel of finite capacity. The goal of the server is to take actions based on these estimates to minimize cumulative regret. To this end, we develop a novel and general algorithmic framework that hinges on two main components: (i) an adaptive encoding mechanism that exploits statistical concentration bounds, and (ii) a decisionmaking principle based on confidence sets that account for encoding errors. As our main result, we prove that when the unknown model is ddimensional, a channel capacity of O(d) bits suffices to achieve orderoptimal regret. We also establish that for the simpler unstructured multiarmed bandit problem, 1 bit channel capacity is sufficient for achieving optimal regret bounds. Keywords: Linear Bandits, Distributed Learning, Communication Constraints
more »
« less
Linear stochastic bandits over a bitconstrained channel
One of the primary challenges in largescale distributed learning stems from stringent communication constraints. While several recent works address this challenge for static optimization problems, sequential decisionmaking under uncertainty has remained much less explored in this regard. Motivated by this gap, we introduce a new linear stochastic bandit formulation over a bitconstrained channel. Specifically, in our setup, an agent interacting with an environment transmits encoded estimates of an unknown model parameter to a server over a communication channel of finite capacity. The goal of the server is to take actions based on these estimates to minimize cumulative regret. To this end, we develop a novel and general algorithmic framework that hinges on two main components:(i) an adaptive encoding mechanism that exploits statistical concentration bounds, and (ii) a decisionmaking principle based on confidence sets that account for encoding errors. As our main result, we prove that when the unknown model is dimensional, a channel capacity of bits suffices to achieve orderoptimal regret. We also establish that for the simpler unstructured multiarmed bandit problem, bit channel capacity is sufficient for achieving optimal regret bounds.
more »
« less
 Award ID(s):
 1910056
 NSFPAR ID:
 10490381
 Publisher / Repository:
 Proceedings of Machine Learning Research
 Date Published:
 Journal Name:
 Learning for Dynamics and Control Conference
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


One of the primary challenges in largescale distributed learning stems from stringent communication constraints. While several recent works address this challenge for static optimization problems, sequential decisionmaking under uncertainty has remained much less explored in this regard. Motivated by this gap, we introduce a new linear stochastic bandit formulation over a bitconstrained channel. Specifically, in our setup, an agent interacting with an environment transmits encoded estimates of an unknown model parameter to a server over a communication channel of finite capacity. The goal of the server is to take actions based on these estimates to minimize cumulative regret. To this end, we develop a novel and general algorithmic framework that hinges on two main components: (i) an adaptive encoding mechanism that exploits statistical concentration bounds, and (ii) a decisionmaking principle based on confidence sets that account for encoding errors. As our main result, we prove that when the unknown model is ddimensional, a channel capacity of O(d) bits suffices to achieve orderoptimal regret. We also establish that for the simpler unstructured multiarmed bandit problem, 1 bit channel capacity is sufficient for achieving optimal regret bounds.more » « less

We consider a stochastic inventory control problem under censored demand, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near optimality of a simple class of policies called “basestock policies” as well as the convexity of longrun average cost under those policies. We consider a relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound the regret of the algorithm when compared with the best basestock policy. Our main contribution is a learning algorithm with a regret bound of [Formula: see text] for the inventory control problem. Here, [Formula: see text] is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the expected number of time steps needed to generate enough demand to deplete one unit of inventory. Notably, our regret bounds depend linearly on L, which significantly improves the previously bestknown regret bounds for this problem where the dependence on L was exponential. Our techniques utilize the convexity of the longrun average cost and a newly derived bound on the “bias” of basestock policies to establish an almost black box connection between the problem of learning in Markov decision processes (MDPs) with these properties and the stochastic convex bandit problem. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex asymptotic average cost functions.more » « less

We consider the problem of controlling a Linear Quadratic Regulator (LQR) system over a finite horizon T with fixed and known cost matrices Q,R, but unknown and nonstationary dynamics A_t, B_t. The sequence of dynamics matrices can be arbitrary, but with a total variation, V_T, assumed to be o(T) and unknown to the controller. Under the assumption that a sequence of stabilizing, but potentially suboptimal controllers is available for all t, we present an algorithm that achieves the optimal dynamic regret of O(V_T^2/5 T^3/5 ). With piecewise constant dynamics, our algorithm achieves the optimal regret of O(sqrtST ) where S is the number of switches. The crux of our algorithm is an adaptive nonstationarity detection strategy, which builds on an approach recently developed for contextual Multiarmed Bandit problems. We also argue that nonadaptive forgetting (e.g., restarting or using sliding window learning with a static window size) may not be regret optimal for the LQR problem, even when the window size is optimally tuned with the knowledge of $V_T$. The main technical challenge in the analysis of our algorithm is to prove that the ordinary least squares (OLS) estimator has a small bias when the parameter to be estimated is nonstationary. Our analysis also highlights that the key motif driving the regret is that the LQR problem is in spirit a bandit problem with linear feedback and locally quadratic cost. This motif is more universal than the LQR problem itself, and therefore we believe our results should find wider application.more » « less

Motivated by robust dynamic resource allocation in operations research, we study the Online Learning to Transport (OLT) problem where the decision variable is a probability measure, an infinitedimensional object. We draw connections between online learning, optimal transport, and partial differential equations through an insight called the minimal selection principle, originally studied in the Wasserstein gradient flow setting by Ambrosio et al. (2005). This allows us to extend the standard online learning framework to the infinitedimensional setting seamlessly. Based on our framework, we derive a novel method called the minimal selection or exploration (MSoE) algorithm to solve OLT problems using meanfield approximation and discretization techniques. In the displacement convex setting, the main theoretical message underpinning our approach is that minimizing transport cost over time (via the minimal selection principle) ensures optimal cumulative regret upper bounds. On the algorithmic side, our MSoE algorithm applies beyond the displacement convex setting, making the mathematical theory of optimal transport practically relevant to nonconvex settings common in dynamic resource allocation.more » « less