skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: DYNAMIC OPTIMIZATION OF DRONE DISPATCH FOR SUBSTANCE OVERDOSE RESCUE
Opioid overdose rescue is very time-sensitive. Hence, drone-delivered naloxone has the potential to be a transformative innovation due to its easily deployable and flexible nature. We formulate a Markov Decision Process (MDP) model to dispatch the appropriate drone after an overdose request arrives and to relocate the drone to its next waiting location after having completed its current task. Since the underlying optimization problem is subject to the curse of dimensionality, we solve it using ad-hoc state aggregation and evaluate it through a simulation with higher granularity. Our simulation-based comparative study is based on emergency medical service data from the state of Indiana. We compare the optimal policy resulting from the scaled-down MDP model with a myopic policy as the baseline. We consider the impact of drone type and service area type on outcomes, which offers insights into the performance of the MDP suboptimal policy under various settings.  more » « less
Award ID(s):
1761022
PAR ID:
10190834
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 2020 Winter Simulation Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g., in model learning, planning etc.) to directly translate into improvements for offline RL. 
    more » « less
  2. We study optimal pricing in a single-server queueing system that can be observable or unobservable, depending on how customers receive information to estimate sojourn time. Our primary objective is to determine whether the service provider is better off making the system observable or unobservable under optimal pricing. We formulate the optimal pricing problem using Markov decision process (MDP) models for both observable and unobservable systems. For unobservable systems, the problem is studied using an MDP with a fixed-point equation as equilibrium constraints. We show that the MDPs for both observable and unobservable queues are special cases of a generalized arrivals-based MDP model, in which the optimal arrival rate (rather than price) is set in each state. Then, we show that the optimal policy that solves the generalized MDP exhibits a monotone structure in that the optimal arrival rate is non-increasing in the queue length, which allows for developing efficient algorithms to determine optimal pricing policies. Next, we show that if no customers overestimate sojourn time in the observable system, it is in the interest of the service provider to make the system observable. We also show that if all customers overestimate sojourn time, the service provider is better off making the system unobservable. Lastly, we learn from numerical results that when customers are heterogeneous in estimating their sojourn time, the service provider is expected to receive a higher gain by making the system observable if on average customers do not significantly overestimate sojourn time. 
    more » « less
  3. In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a near-optimal policy in this pessimistic MDP. The design of the pessimistic MDP is such that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the pessimistic MDP. This enables the pessimistic MDP to serve as a good surrogate for purposes of policy evaluation and learning. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Empirically, MOReL matches or exceeds state-of-the-art results on widely used offline RL benchmarks. Overall, the modular design of MOReL enables translating advances in its components (for e.g., in model learning, planning etc.) to improvements in offline RL. 
    more » « less
  4. This paper introduces an approach based on control theory to model, analyze and select optimal security policies for Moving Target Defense (MTD) deployment strategies. A Markov Decision Process (MDP) scheme is presented to model states of the system from attacking point of view. The employed value iteration method is based on the Bellman optimality equation for optimal policy selection for each state defined in the system.The model is then utilized to analyze the impact of various costs on the optimal policy. The MDP model is then applied to two case studies to evaluate the performance of the model. 
    more » « less
  5. The flexibility and scale of networks achievable by modern cloud computer architectures, particularly Kubernetes (K8s)-based applications, are rivaled only by the resulting complexity required to operate at scale in a responsive manner. This leaves applications vulnerable toEconomic Denial of Sustainability(EDoS) attacks, designed to force service withdrawal via draining the target of the financial means to support the application. With the public cloud market projected to reach three quarters of a trillion dollars USD by the end of 2025, this is a major consideration. In this paper, we develop a theoretical model to reason about EDoS attacks on K8s. We determine scaling thresholds based on Markov Decision Processes (MDPs), incorporating costs of operating K8s replicas, Service Level Agreement violations, and minimum service charges imposed by billing structures. We build on top of the MDP model a Stackelberg game, determining the circumstances under which an adversary injects traffic. The optimal policy returned by the MDP is generally of hysteresis-type, but not always. Specifically, through numerical evaluations we show examples where charges on an hourly resolution eliminate incentives for scaling down resources. Furthermore, through the use of experiments on a realistic K8s cluster, we show that, depending on the billing model employed and the customer workload characteristics, an EDoS attack can result in a 4× increase in traffic intensity resulting in a 3.6× decrease in efficiency. Interestingly, increasing the intensity of an attack may render it less efficient per unit of attack power. Finally, we demonstrate a proof-of-concept for a countermeasure involving custom scaling metrics where autoscaling decisions are randomized. We demonstrate that per-minute utilization charges are reduced compared to standard scaling, with negligible drops in requests. 
    more » « less