skip to main content


Title: Link rate selection using constrained Thompson sampling
We consider the optimal link rate selection problem in time-varying wireless channels with unknown channel statistics. The aim of optimal link rate selection is to transmit at the optimal rate at each time slot in order to maximize the expected throughput of the wireless channel/link or equivalently minimize the expected regret. Lack of information about channel state or channel statistics necessitates the use of online/sequential learning algorithms to determine the optimal rate. We present an algorithm called CoTS - Constrained Thompson sampling algorithm which improves upon the current state-of-the-art, is fast and is also general in the sense that it can handle several different constraints in the problem with the same algorithm. We also prove an asymptotic lower bound on the expected regret and a high probability large-horizon upper bound on the regret, which show that the regret grows logarithmically with time in an order sense. We also provide numerical results which establish that CoTS significantly outperforms the current state-of-the-art algorithms.  more » « less
Award ID(s):
1718203 1562276 1609370
NSF-PAR ID:
10093904
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE INFOCOM
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of transmitting at the optimal rate over a rapidly-varying wireless channel with unknown statistics when the feedback about channel quality is very limited. One motivation for this problem is that, in emerging wireless networks, the use of mmWave bands means that the channel quality can fluctuate rapidly and thus, one cannot rely on full channel-state feedback to make transmission rate decisions. Inspired by related problems in the context of multi-armed bandits, we consider a well-known algorithm called Thompson sampling to address this problem. However, unlike the traditional multi-armed bandit problem, a direct application of Thompson sampling results in a computational and storage complexity that grows exponentially with time. Therefore, we propose an algorithm called Modified Thompson sampling (MTS), whose computational and storage complexity is simply linear in the number of channel states and which achieves at most logarithmic regret as a function of time when compared to an optimal algorithm which knows the probability distribution of the channel states. 
    more » « less
  2. SkinnerDB uses reinforcement learning for reliable join ordering, exploiting an adaptive processing engine with specialized join algorithms and data structures. It maintains no data statistics and uses no cost or cardinality models. Also, it uses no training workloads nor does it try to link the current query to seemingly similar queries in the past. Instead, it uses reinforcement learning to learn optimal join orders from scratch during the execution of the current query. To that purpose, it divides the execution of a query into many small time slices. Different join orders are tried in different time slices. SkinnerDB merges result tuples generated according to different join orders until a complete query result is obtained. By measuring execution progress per time slice, it identifies promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We upper-bound expected execution cost regret, i.e., the expected amount of execution cost wasted due to sub-optimal join order choices. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB’s performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark, TPC-H, and JCC-H, as well as benchmark variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice. 
    more » « less
  3. Current wireless networks employ sophisticated multi-user transmission techniques to fully utilize the physical layer resources for data transmission. At the MAC layer, these techniques rely on a semi-static map that translates the channel quality of users to the potential transmission rate (more precisely, a map from the Channel Quality Index to the Modulation and Coding Scheme) for user selection and scheduling decisions. However, such a static map does not adapt to the actual deployment scenario and can lead to large performance losses. Furthermore, adaptively learning this map can be inefficient, particularly when there are a large number of users. In this work, we make this learning efficient by clustering users. Specifically, we develop an online learning approach that jointly clusters users and channel-states, and learns the associated rate regions of each cluster. This approach generates a scenario-specific map that replaces the static map that is currently used in practice. Furthermore, we show that our learning algorithm achieves sub- linear regret when compared to an omniscient genie. Next, we develop a user selection algorithm for multi-user scheduling using the learned user-clusters and associated rate regions. Our algorithms are validated on the WiNGS simulator from AT&T Labs, that implements the PHY/MAC stack and simulates the channel. We show that our algorithm can efficiently learn user clusters and the rate regions associated with the user sets for any observed channel state. Moreover, our simulations show that a deployment-scenario-specific map significantly outperforms the current static map approach for resource allocation at the MAC layer. 
    more » « less
  4. We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-K subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments. 
    more » « less
  5. In black-box optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In real-life, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multi-fidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noise-tolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multi-fidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayed-UCB1 (DUCB1) and delayed-UCB-V (DUCBV) algorithms. Given a horizon T, PCTS retains the regret bound of non-delayed HOO for expected delay of O(log T), and worsens by T^((1-α)/(d+2)) for expected delays of O(T^(1-α)) for α ∈ (0,1]. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the state-of-the-art black-box optimization methods for feedbacks with different noise levels, delays, and fidelity. 
    more » « less