skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM to 12:00 AM ET on Tuesday, March 25 due to maintenance. We apologize for the inconvenience.


Title: Index-aware reinforcement learning for adaptive video streaming at the wireless edge
We study adaptive video streaming for multiple users in wireless access edge networks with unreliable channels. The key challenge is to jointly optimize the video bitrate adaptation and resource allocation such that the users' cumulative quality of experience is maximized. This problem is a finite-horizon restless multi-armed multi-action bandit problem and is provably hard to solve. To overcome this challenge, we propose a computationally appealing index policy entitled Quality Index Policy, which is well-defined without the Whittle indexability condition and is provably asymptotically optimal without the global attractor condition. These two conditions are widely needed in the design of most existing index policies, which are difficult to establish in general. Since the wireless access edge network environment is highly dynamic with system parameters unknown and time-varying, we further develop an index-aware reinforcement learning (RL) algorithm dubbed QA-UCB. We show that QA-UCB achieves a sub-linear regret with a low-complexity since it fully exploits the structure of the Quality Index Policy for making decisions. Extensive simulations using real-world traces demonstrate significant gains of proposed policies over conventional approaches. We note that the proposed framework for designing index policy and index-aware RL algorithm is of independent interest and could be useful for other large-scale multi-user problems.  more » « less
Award ID(s):
2148309
PAR ID:
10410843
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
MobiHoc '22: Proceedings of the Twenty-Third International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing
Page Range / eLocation ID:
81 to 90
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We present Revel, a partially neural reinforcement learning (RL) framework for provably safe exploration in continuous state and action spaces. A key challenge for provably safe deep RL is that repeatedly verifying neural networks within a learning loop is computationally infeasible. We address this challenge using two policy classes: a general, neurosymbolic class with approximate gradients and a more restricted class of symbolic policies that allows efficient verification. Our learning algorithm is a mirror descent over policies: in each iteration, it safely lifts a symbolic policy into the neurosymbolic space, performs safe gradient updates to the resulting policy, and projects the updated policy into the safe symbolic subset, all without requiring explicit verification of neural networks. Our empirical results show that Revel enforces safe exploration in many scenarios in which Constrained Policy Optimization does not, and that it can discover policies that outperform those learned through prior approaches to verified exploration. 
    more » « less
  2. We study the dynamic cache dimensioning problem, where the objective is to decide how much storage to place in the cache to minimize the total costs with respect to the storage and content delivery latency. We formulate this problem as a Markov decision process, which turns out to be a restless multi-armed bandit problem and is provably hard to solve. For given dimensioning decisions, it is possible to develop solutions based on the celebrated Whittle index policy. However, Whittle index policy has not been studied for dynamic cache dimensioning, mainly because cache dimensioning needs to be repeatedly solved and jointly optimized with content caching. To overcome this difficulty, we propose a low-complexity fluid Whittle index policy, which jointly determines dimensioning and content caching. We show that this policy is asymptotically optimal. We further develop a lightweight reinforcement learning augmented algorithm dubbed fW-UCB when the content request and delivery rates are unavailable. fW-UCB is shown to achieve a sub-linear regret as it fully exploits the structure of the near-optimal fluid Whittle index policy and hence can be easily implemented. Extensive simulations using real traces support our theoretical results. 
    more » « less
  3. We study the dynamic cache dimensioning problem, where the objective is to decide how much storage to place in the cache to minimize the total costs with respect to the storage and content delivery latency. We formulate this problem as a Markov decision process, which turns out to be a restless multi-armed bandit problem and is provably hard to solve. For given dimensioning decisions, it is possible to develop solutions based on the celebrated Whittle index policy. However, Whittle index policy has not been studied for dynamic cache dimensioning, mainly because cache dimensioning needs to be repeatedly solved and jointly optimized with content caching. To overcome this difficulty, we propose a low-complexity fluid Whittle index policy, which jointly determines dimensioning and content caching. We show that this policy is asymptotically optimal. We further develop a lightweight reinforcement learning augmented algorithm dubbed fW-UCB when the content request and delivery rates are unavailable. fW-UCB is shown to achieve a sub-linear regret as it fully exploits the structure of the near-optimal fluid Whittle index policy and hence can be easily implemented. Extensive simulations using real traces support our theoretical results. 
    more » « less
  4. Multi-band operation in wireless networks can improve data rates by leveraging the benefits of propagation in different frequency ranges. Distinctive beam management procedures in different bands complicate band assignment because they require considering not only the channel quality but also the associated beam management overhead. Reinforcement learning (RL) is a promising approach for multi-band operation as it enables the system to learn and adjust its behavior through environmental feedback. In this paper, we formulate a sequential decision problem to jointly perform band assignment and beam management. We propose a method based on hierarchical RL (HRL) to handle the complexity of the problem by separating the policies for band selection and beam management. We evaluate the proposed HRL-based algorithm on a realistic channel generated based on ray-tracing simulators. Our results show that the proposed approach outperforms traditional RL approaches in terms of reduced beam training overhead and increased data rates under a realistic vehicular channel. 
    more » « less
  5. We study off-dynamics Reinforcement Learning (RL), where the policy is trained on a source domain and deployed to a distinct target domain. We aim to solve this problem via online distributionally robust Markov decision processes (DRMDPs), where the learning algorithm actively interacts with the source domain while seeking the optimal performance under the worst possible dynamics that is within an uncertainty set of the source domain's transition kernel. We provide the first study on online DRMDPs with function approximation for off-dynamics RL. We find that DRMDPs' dual formulation can induce nonlinearity, even when the nominal transition kernel is linear, leading to error propagation. By designing a $$d$$-rectangular uncertainty set using the total variation distance, we remove this additional nonlinearity and bypass the error propagation. We then introduce DR-LSVI-UCB, the first provably efficient online DRMDP algorithm for off-dynamics RL with function approximation, and establish a polynomial suboptimality bound that is independent of the state and action space sizes. Our work makes the first step towards a deeper understanding of the provable efficiency of online DRMDPs with linear function approximation. Finally, we substantiate the performance and robustness of DR-LSVI-UCB through different numerical experiments. 
    more » « less