In this paper, we propose and study opportunistic contextual bandits - a special case of contextual bandits where the exploration cost varies under different environmental conditions, such as network load or return variation in recommendations. When the exploration cost is low, so is the actual regret of pulling a sub-optimal arm (e.g., trying a suboptimal recommendation). Therefore, intuitively, we could explore more when the exploration cost is relatively low and exploit more when the exploration cost is relatively high. Inspired by this intuition, for opportunistic contextual bandits with Linear payoffs, we propose an Adaptive Upper-Confidence-Bound algorithm (AdaLinUCB) to adaptively balance the exploration-exploitation trade-off for opportunistic learning. We prove that AdaLinUCB achieves O((log T)^2) problem-dependent regret upper bound, which has a smaller coefficient than that of the traditional LinUCB algorithm. Moreover, based on both synthetic and real-world dataset, we show that AdaLinUCB significantly outperforms other contextual bandit algorithms, under large exploration cost fluctuations.
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
-
Cellular network performance depends heavily on the configuration of its network parameters. Current practice of parameter configuration relies largely on expert experience, which is often suboptimal, time-consuming, and error-prone. Therefore, it is desirable to automate this process to improve the accuracy and efficiency via learning-based approaches. However, such approaches need to address several challenges in real operational networks: the lack of diverse historical data, a limited amount of experiment budget set by network operators, and highly complex and unknown network performance functions. To address those challenges, we propose a collaborative learning approach to leverage data from different cells to boost the learning efficiency and to improve network performance. Specifically, we formulate the problem as a transferable contextual bandit problem, and prove that by transfer learning, one could significantly reduce the regret bound. Based on the theoretical result, we further develop a practical algorithm that decomposes a cell’s policy into a common homogeneous policy learned using all cells’ data and a cell-specific policy that captures each individual cell’s heterogeneous behavior. We evaluate our proposed algorithm via a simulator constructed using real network data and demonstrates faster convergence compared to baselines. More importantly, a live field test is also conducted on a real metropolitan cellular network consisting 1700+ cells to optimize five parameters for two weeks. Our proposed algorithm shows a significant performance improvement of 20%.more » « less
-
In this paper, we propose and study opportunistic bandits - a new variant of bandits where the regret of pulling a suboptimal arm varies under different environmental conditions, such as network load or produce price. When the load/price is low, so is the cost/regret of pulling a suboptimal arm (e.g., trying a suboptimal network configuration). Therefore, intuitively, we could explore more when the load/price is low and exploit more when the load/price is high. Inspired by this intuition, we propose an Adaptive Upper-Confidence-Bound (AdaUCB) algorithm to adaptively balance the exploration-exploitation tradeoff for opportunistic bandits. We prove that AdaUCB achieves O(log T) regret with a smaller coefficient than the traditional UCB algorithm. Furthermore, AdaUCB achieves O(1) regret with respect to T if the exploration cost is zero when the load level is below a certain threshold. Last, based on both synthetic data and real-world traces, experimental results show that AdaUCB significantly outperforms other bandit algorithms, such as UCB and TS (Thompson Sampling), under large load/price fluctuation.more » « less
-
Cellular network configuration is critical for network performance. Current practice is labor-intensive, errorprone, and far from optimal. To automate efficient cellular network configuration, in this work, we propose an onlinelearning-based joint-optimization approach that addresses a few specific challenges: limited data availability, convoluted sample data, highly complex optimization due to interactions among neighboring cells, and the need to adapt to network dynamics. In our approach, to learn an appropriate utility function for a cell, we develop a neural-network-based model that addresses the convoluted sample data issue and achieves good accuracy based on data aggregation. Based on the utility function learned, we formulate a global network configuration optimization problem. To solve this high-dimensional nonconcave maximization problem, we design a Gibbs-sampling-based algorithm that converges to an optimal solution when a technical parameter is small enough. Furthermore, we design an online scheme that updates the learned utility function and solves the corresponding maximization problem efficiently to adapt to network dynamics. To illustrate the idea, we use the case study of pilot power configuration. Numerical results illustrate the effectiveness of the proposed approach.more » « less
-
Cellular network configuration is critical for network performance. Current practice is labor-intensive, errorprone, and far from optimal. To automate efficient cellular network configuration, in this work, we propose an onlinelearning-based joint-optimization approach that addresses a few specific challenges: limited data availability, convoluted sample data, highly complex optimization due to interactions among neighboring cells, and the need to adapt to network dynamics. In our approach, to learn an appropriate utility function for a cell, we develop a neural-network-based model that addresses the convoluted sample data issue and achieves good accuracy based on data aggregation. Based on the utility function learned, we formulate a global network configuration optimization problem. To solve this high-dimensional nonconcave maximization problem, we design a Gibbs-samplingbased algorithm that converges to an optimal solution when a technical parameter is small enough. Furthermore, we design an online scheme that updates the learned utility function and solves the corresponding maximization problem efficiently to adapt to network dynamics. To illustrate the idea, we use the case study of pilot power configuration. Numerical results illustrate the effectiveness of the proposed approach.more » « less