This paper studies multi-stage systems with end-to-end bandit feedback. In such systems, each job needs to go through multiple stages, each managed by a different agent, before generating an outcome. Each agent can only control its own action and learn the final outcome of the job. It has neither knowledge nor control on actions taken by agents in the next stage. The goal of this paper is to develop distributed online learning algorithms that achieve sublinear regret in adversarial environments. The setting of this paper significantly expands the traditional multi-armed bandit problem, which considers only one agent and one stage. In addition to the exploration-exploitation dilemma in the traditional multi-armed bandit problem, we show that the consideration of multiple stages introduces a third component, education, where an agent needs to choose its actions to facilitate the learning of agents in the next stage. To solve this newly introduced exploration-exploitation-education trilemma, we propose a simple distributed online learning algorithm, ϵ-EXP3. We theoretically prove that the ϵ-EXP3 algorithm is a no-regret policy that achieves sublinear regret. Simulation results show that the ϵ-EXP3 algorithm significantly outperforms existing no-regret online learning algorithms for the traditional multi-armed bandit problem.
more »
« less
Multi-Armed Bandit On-Time Arrival Algorithms for Sequential Reliable Route Selection under Uncertainty
Traditionally vehicles act only as servers in transporting passengers and goods. With increasing sensor equipment in vehicles, including automated vehicles, there is a need to test algorithms that consider the dual role of vehicles as both servers and sensors. The paper formulates a sequential route selection problem as a shortest path problem with on-time arrival reliability under a multi-armed bandit setting, a type of reinforcement learning model. A decision-maker has to make a finite set of decisions sequentially on departure time and path between a fixed origin-destination pair such that on-time reliability is maximized while travel time is minimized. The upper confidence bound algorithm is extended to handle this problem. Several tests are conducted. First, simulated data successfully verifies the method, then a real-data scenario is constructed of a hotel shuttle service from midtown Manhattan in New York City providing hourly access to John F. Kennedy International Airport. Results suggest that route selection with multi-armed bandit learning algorithms can be effective but neglecting passenger scheduling constraints can have negative effects on on-time arrival reliability by as much as 4.8% and combined reliability and travel time by 66.1%.
more »
« less
- Award ID(s):
- 1652735
- PAR ID:
- 10130355
- Date Published:
- Journal Name:
- Transportation Research Record: Journal of the Transportation Research Board
- Volume:
- 2673
- Issue:
- 10
- ISSN:
- 0361-1981
- Page Range / eLocation ID:
- 673 to 682
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Increasing e-commerce activity, competition for shorter delivery times, and innovations in transportation technologies have pushed the industry toward instant delivery logistics. This paper studies a facility location and online demand allocation problem applicable to a logistics company expanding to offer instant delivery service using unmanned aerial vehicles or drones. The problem is decomposed into two stages. During the planning stage, the facilities are located, and product and battery capacity are allocated. During the operational stage, customers place orders dynamically and real-time demand allocation decisions are made. The paper explores a multi-armed bandit framework for maximizing the cumulative reward realized by the logistics company subject to various capacity constraints and compares it with other strategies. The multi-armed bandit framework provides about 7% more rewards than the second-best strategy when tested on standard test instances. A case study based in Portland Metro Area showed that multi-armed bandits can outperform the second-best strategy by more than 20%.more » « less
-
As Autonomous Vehicles (AVs) become possible for E-hailing services operate, especially when telecom companies start deploying next-generation wireless networks (known as 5G), many new technologies may be applied in these vehicles. Dynamic-route-switching is one of these technologies, which could help vehicles find the best possible route based on real-time traffic information. However, allowing all AVs to choose their own optimal routes is not the best solution for a complex city network, since each vehicle ignores its negative effect on the road system due to the additional congestion it creates. As a result, with this system, some of the links may become over-congested, causing the whole road network system performance to degrade. Meanwhile, the travel time reliability, especially during the peak hours, is an essential factor to improve the customers' ride experience. Unfortunately, these two issues have received relatively less attention. In this paper, we design a link-based dynamic pricing model to improve the road network system and travel time reliability at the same time. In this approach, we assume that all links are eligible with the dynamic pricing, and AVs will be perfect informed with update traffic condition and follow the dynamic road pricing. A heuristic approach is developed to address this computationally difficult problem. The output includes link-based surcharge, new travel demand and traffic condition which would improve the system performance close to the System Optimal (SO) solution and maintain the travel time reliability. Finally, we evaluate the effectiveness and efficiency of the proposed model to the well-known test Sioux Falls network.more » « less
-
Building an integrated human-machine decision-making system requires developing effective interfaces between the human and the machine. We develop such an interface by studying the multi-armed bandit problem, a simple sequential decision-making paradigm that can model a variety of tasks. We construct Bayesian algorithms for the multi-armed ban- dit problem, prove conditions under which these algorithms achieve good performance, and empirically show that, with appropriate priors, these algorithms effectively model human choice behavior; the priors then form a principled interface from human to machine. We take a signal processing perspective on the prior estimation problem and develop methods to estimate the priors given human choice data.more » « less
-
null (Ed.)In multi-server queueing systems where there is no central queue holding all incoming jobs, job dispatching policies are used to assign incoming jobs to the queue at one of the servers. Classic job dispatching policies such as join-the-shortest-queue and shortest expected delay assume that the service rates and queue lengths of the servers are known to the dispatcher. In this work, we tackle the problem of job dispatching without the knowledge of service rates and queue lengths, where the dispatcher can only obtain noisy estimates of the service rates by observing job departures. This problem presents a novel exploration-exploitation trade-off between sending jobs to all the servers to estimate their service rates, and exploiting the currently known fastest servers to minimize the expected queueing delay. We propose a bandit-based exploration policy that learns the service rates from observed job departures. Unlike the standard multi-armed bandit problem where only one out of a finite set of actions is optimal, here the optimal policy requires identifying the optimal fraction of incoming jobs to be sent to each server. We present a regret analysis and simulations to demonstrate the effectiveness of the proposed bandit-based exploration policy.more » « less