skip to main content


Title: DYNAMIC OPTIMIZATION OF DRONE DISPATCH FOR SUBSTANCE OVERDOSE RESCUE
Opioid overdose rescue is very time-sensitive. Hence, drone-delivered naloxone has the potential to be a transformative innovation due to its easily deployable and flexible nature. We formulate a Markov Decision Process (MDP) model to dispatch the appropriate drone after an overdose request arrives and to relocate the drone to its next waiting location after having completed its current task. Since the underlying optimization problem is subject to the curse of dimensionality, we solve it using ad-hoc state aggregation and evaluate it through a simulation with higher granularity. Our simulation-based comparative study is based on emergency medical service data from the state of Indiana. We compare the optimal policy resulting from the scaled-down MDP model with a myopic policy as the baseline. We consider the impact of drone type and service area type on outcomes, which offers insights into the performance of the MDP suboptimal policy under various settings.  more » « less
Award ID(s):
1761022
NSF-PAR ID:
10190834
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 2020 Winter Simulation Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We investigate the management of a merchant wind energy farm co‐located with a grid‐level storage facility and connected to a market through a transmission line. We formulate this problem as a Markov decision process (MDP) with stochastic wind speed and electricity prices. Consistent with most deregulated electricity markets, our model allows these prices to be negative. As this feature makes it difficult to characterize any optimal policy of our MDP, we show the optimality of astage‐ and partial‐state‐dependent‐thresholdpolicy when prices can only be positive. We extend this structure when prices can also be negative to develop heuristic one (H1) that approximately solves a stochastic dynamic program. We then simplify H1 to obtain heuristic two (H2) that relies on aprice‐dependent‐thresholdpolicy and derivative‐free deterministic optimization embedded within a Monte Carlo simulation of the random processes of our MDP. We conduct an extensive and data‐calibrated numerical study to assess the performance of these heuristics and variants of known ones against the optimal policy, as well as to quantify the effect of negative prices on the value added by and environmental benefit of storage. We find that (i) H1 computes an optimal policy and on average is about 17 times faster to execute than directly obtaining an optimal policy; (ii) H2 has a near optimal policy (with a 2.86% average optimality gap), exhibits a two orders of magnitude average speed advantage over H1, and outperforms the remaining considered heuristics; (iii) storage brings in more value but its environmental benefit falls as negative electricity prices occur more frequently in our model.

     
    more » « less
  2. In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g., in model learning, planning etc.) to directly translate into improvements for offline RL. 
    more » « less
  3. null (Ed.)
    We consider the problem faced by a service platform that needs to match limited supply with demand while learning the attributes of new users to match them better in the future. We introduce a benchmark model with heterogeneous workers (demand) and a limited supply of jobs that arrive over time. Job types are known to the platform, but worker types are unknown and must be learned by observing match outcomes. Workers depart after performing a certain number of jobs. The expected payoff from a match depends on the pair of types, and the goal is to maximize the steady-state rate of accumulation of payoff. Although we use terminology inspired by labor markets, our framework applies more broadly to platforms where a limited supply of heterogeneous products is matched to users over time. Our main contribution is a complete characterization of the structure of the optimal policy in the limit that each worker performs many jobs. The platform faces a tradeoff for each worker between myopically maximizing payoffs (exploitation) and learning the type of the worker (exploration). This creates a multitude of multiarmed bandit problems, one for each worker, coupled together by the constraint on availability of jobs of different types (capacity constraints). We find that the platform should estimate a shadow price for each job type and use the payoffs adjusted by these prices first to determine its learning goals and then for each worker (i) to balance learning with payoffs during the exploration phase and (ii) to myopically match after it has achieved its learning goals during the exploitation phase. 
    more » « less
  4. We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{-2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worst-case mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. 
    more » « less
  5. In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a near-optimal policy in this pessimistic MDP. The design of the pessimistic MDP is such that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the pessimistic MDP. This enables the pessimistic MDP to serve as a good surrogate for purposes of policy evaluation and learning. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Empirically, MOReL matches or exceeds state-of-the-art results on widely used offline RL benchmarks. Overall, the modular design of MOReL enables translating advances in its components (for e.g., in model learning, planning etc.) to improvements in offline RL. 
    more » « less