Increasing e-commerce activity, competition for shorter delivery times, and innovations in transportation technologies have pushed the industry toward instant delivery logistics. This paper studies a facility location and online demand allocation problem applicable to a logistics company expanding to offer instant delivery service using unmanned aerial vehicles or drones. The problem is decomposed into two stages. During the planning stage, the facilities are located, and product and battery capacity are allocated. During the operational stage, customers place orders dynamically and real-time demand allocation decisions are made. The paper explores a multi-armed bandit framework for maximizing the cumulative reward realized by the logistics company subject to various capacity constraints and compares it with other strategies. The multi-armed bandit framework provides about 7% more rewards than the second-best strategy when tested on standard test instances. A case study based in Portland Metro Area showed that multi-armed bandits can outperform the second-best strategy by more than 20%.
more » « less- PAR ID:
- 10379476
- Publisher / Repository:
- SAGE Publications
- Date Published:
- Journal Name:
- Transportation Research Record: Journal of the Transportation Research Board
- Volume:
- 2676
- Issue:
- 7
- ISSN:
- 0361-1981
- Format(s):
- Medium: X Size: p. 697-710
- Size(s):
- p. 697-710
- Sponsoring Org:
- National Science Foundation
More Like this
-
On-demand warehousing platforms match companies with underutilized warehouse and distribution capabilities with customers who need extra space or distribution services. These new business models have unique advantages, in terms of reduced capacity and commitment granularity, but also have different cost structures compared with traditional ways of obtaining distribution capabilities. This research is the first quantitative analysis to consider distribution network strategies given the advent of on-demand warehousing. Our multi-period facility location model – a mixed-integer linear program – simultaneously determines location-allocation decisions of three distribution center types (self-distribution, 3PL/lease, on-demand). A simulation model operationally evaluates the impact of the planned distribution strategy when various uncertainties can occur. Computational experiments for a company receiving products produced internationally to fulfil a set of regional customer demands illustrate that the power of on-demand warehousing is in creating hybrid network designs that more efficiently use self-distribution facilities through improved capacity utilization. However, the business case for on-demand warehousing is shown to be influenced by several factors, namely on-demand capacity availability, responsiveness requirements, and demand patterns. This work supports a firm’s use of on-demand warehousing if it has tight response requirements, for example for same-day delivery; however, if a firm has relaxed response requirements, then on-demand warehousing is only recommended if capacity availability of planned on-demand services is high. We also analyze capacity flexibility options leased by third-party logistics companies for a premium price and draw attention to the importance of them offering more granular solutions to stay competitive in the market.more » « less
-
We propose and evaluate a learning-based framework to address multi-agent resource allocation in coupled wireless systems. In particular we consider, multiple agents (e.g., base stations, access points, etc.) that choose amongst a set of resource allocation options towards achieving their own performance objective /requirements, and where the performance observed at each agent is further coupled with the actions chosen by the other agents, e.g., through interference, channel leakage, etc. The challenge is to find the best collective action. To that end we propose a Multi-Armed Bandit (MAB) framework wherein the best actions (aka arms) are adaptively learned through online reward feedback. Our focus is on systems which are "weakly-coupled" wherein the best arm of each agent is invariant to others' arm selection the majority of the time - this majority structure enables one to develop light weight efficient algorithms. This structure is commonly found in many wireless settings such as channel selection and power control. We develop a bandit algorithm based on the Track-and-Stop strategy, which shows a logarithmic regret with respect to a genie. Finally through simulation, we exhibit the potential use of our model and algorithm in several wireless application scenarios.more » « less
-
Feldman, Vitaly ; Ligett, Katrina ; Sabato, Sivan (Ed.)Many real-world problems like Social Influence Maximization face the dilemma of choosing the best $K$ out of $N$ options at a given time instant. This setup can be modeled as a combinatorial bandit which chooses $K$ out of $N$ arms at each time, with an aim to achieve an efficient trade-off between exploration and exploitation. This is the first work for combinatorial bandits where the feedback received can be a non-linear function of the chosen $K$ arms. The direct use of multi-armed bandit requires choosing among $N$-choose-$K$ options making the state space large. In this paper, we present a novel algorithm which is computationally efficient and the storage is linear in $N$. The proposed algorithm is a divide-and-conquer based strategy, that we call CMAB-SM. Further, the proposed algorithm achieves a \textit{regret bound} of $\tilde O(K^{\frac{1}{2}}N^{\frac{1}{3}}T^{\frac{2}{3}})$ for a time horizon $T$, which is \textit{sub-linear} in all parameters $T$, $N$, and $K$.more » « less
-
This study investigates the problem of decentralized dynamic resource allocation optimization for ad-hoc network communication with the support of reconfigurable intelligent surfaces (RIS), leveraging a reinforcement learning framework. In the present context of cellular networks, device-to-device (D2D) communication stands out as a promising technique to enhance the spectrum efficiency. Simultaneously, RIS have gained considerable attention due to their ability to enhance the quality of dynamic wireless networks by maximizing the spectrum efficiency without increasing the power consumption. However, prevalent centralized D2D transmission schemes require global information, leading to a significant signaling overhead. Conversely, existing distributed schemes, while avoiding the need for global information, often demand frequent information exchange among D2D users, falling short of achieving global optimization. This paper introduces a framework comprising an outer loop and inner loop. In the outer loop, decentralized dynamic resource allocation optimization has been developed for self-organizing network communication aided by RIS. This is accomplished through the application of a multi-player multi-armed bandit approach, completing strategies for RIS and resource block selection. Notably, these strategies operate without requiring signal interaction during execution. Meanwhile, in the inner loop, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm has been adopted for cooperative learning with neural networks (NNs) to obtain optimal transmit power control and RIS phase shift control for multiple users, with a specified RIS and resource block selection policy from the outer loop. Through the utilization of optimization theory, distributed optimal resource allocation can be attained as the outer and inner reinforcement learning algorithms converge over time. Finally, a series of numerical simulations are presented to validate and illustrate the effectiveness of the proposed scheme.
-
We develop a novel and generic algorithm for the adversarial multi-armed bandit problem (or more generally the combinatorial semi-bandit problem). When instantiated differently, our algorithm achieves various new data-dependent regret bounds improving previous work. Examples include: 1) a regret bound depending on the variance of only the best arm; 2) a regret bound depending on the first-order path-length of only the best arm; 3) a regret bound depending on the sum of the first-order path-lengths of all arms as well as an important negative term, which together lead to faster convergence rates for some normal form games with partial feedback; 4) a regret bound that simultaneously implies small regret when the best arm has small loss {\it and} logarithmic regret when there exists an arm whose expected loss is always smaller than those of other arms by a fixed gap (e.g. the classic i.i.d. setting). In some cases, such as the last two results, our algorithm is completely parameter-free. The main idea of our algorithm is to apply the optimism and adaptivity techniques to the well-known Online Mirror Descent framework with a special log-barrier regularizer. The challenges are to come up with appropriate optimistic predictions and correction terms in this framework. Some of our results also crucially rely on using a sophisticated increasing learning rate schedule.more » « less