skip to main content


Title: Good-for-MDPs Automata for Probabilistic Analysis and Reinforcement Learning
We characterize the class of nondeterministic 𝜔-automata that can be used for the analysis of finite Markov decision processes (MDPs). We call these automata ‘good-for-MDPs’ (GFM). We show that GFM automata are closed under classic simulation as well as under more powerful simulation relations that leverage properties of optimal control strategies for MDPs. This closure enables us to exploit state-space reduction techniques, such as those based on direct and delayed simulation, that guarantee simulation equivalence. We demonstrate the promise of GFM automata by defining a new class of automata with favorable properties—they are BĂźchi automata with low branching degree obtained through a simple construction—and show that going beyond limit-deterministic automata may significantly benefit reinforcement learning.  more » « less
Award ID(s):
2009022
NSF-PAR ID:
10216133
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Biere, Armin; Parker, David
Date Published:
Journal Name:
Tools and Algorithms for the Construction and Analysis of Systems
Page Range / eLocation ID:
306-323
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Bouajjani, A. ; HolĂ­k, L. ; Wu, Z. (Ed.)
    When omega-regular objectives were first proposed in model-free reinforcement learning (RL) for controlling MDPs, deterministic Rabin automata were used in an attempt to provide a direct translation from their transitions to scalar values. While these translations failed, it has turned out that it is possible to repair them by using good-for-MDPs (GFM) Buechi automata instead. These are nondeterministic Buechi automata with a restricted type of nondeterminism, albeit not as restricted as in good-for-games automata. Indeed, deterministic Rabin automata have a pretty straightforward translation to such GFM automata, which is bi-linear in the number of states and pairs. Interestingly, the same cannot be said for deterministic Streett automata: a translation to nondeterministic Rabin or Buechi automata comes at an exponential cost, even without requiring the target automaton to be good-for-MDPs. Do we have to pay more than that to obtain a good-for-MDPs automaton? The surprising answer is that we have to pay significantly less when we instead expand the good-for-MDPs property to alternating automata: like the nondeterministic GFM automata obtained from deterministic Rabin automata, the alternating good-for-MDPs automata we produce from deterministic Streett automata are bi-linear in the size of the deterministic automaton and its index. They can therefore be exponentially more succinct than the minimal nondeterministic Buechi automaton. 
    more » « less
  2. Computation of optimal recovery decisions for community resilience assurance post-hazard is a combinatorial decision-making problem under uncertainty. It involves solving a large-scale optimization problem, which is significantly aggravated by the introduction of uncertainty. In this paper, we draw upon established tools from multiple research communities to provide an effective solution to this challenging problem. We provide a stochastic model of damage to the water network (WN) within a testbed community following a severe earthquake and compute near-optimal recovery actions for restoration of the water network. We formulate this stochastic decision-making problem as a Markov Decision Process (MDP), and solve it using a popular class of heuristic algorithms known as rollout. A simulation-based representation of MDPs is utilized in conjunction with rollout and the Optimal Computing Budget Allocation (OCBA) algorithm to address the resulting stochastic simulation optimization problem. Our method employs non-myopic planning with efficient use of simulation budget. We show, through simulation results, that rollout fused with OCBA performs competitively with respect to rollout with total equal allocation (TEA) at a meagre simulation budget of 5-10% of rollout with TEA, which is a crucial step towards addressing large-scale community recovery problems following natural disasters. 
    more » « less
  3. We propose a novel framework for modeling attack scenarios in cyber-physical control systems: we represent a cyber-physical system as a constrained switching system, where a single model embeds the dynamics of the physical process, the attack patterns, and the attack detection schemes. We show that this is compatible with established results in hybrid automata, namely, constrained switching systems. The proposed attack modeling approach admits a large class of non-deterministic attack policies and permits the characterization of system safety as an asymptotic property. By calculating the maximal safe set, the resulting new impact metrics intuitively quantify the degradation of safety and the impact of cyber attacks on the safety properties of the system under attack. We showcase our results via an illustrative example. 
    more » « less
  4. We consider a stochastic inventory control problem under censored demand, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near optimality of a simple class of policies called “base-stock policies” as well as the convexity of long-run average cost under those policies. We consider a relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound the regret of the algorithm when compared with the best base-stock policy. Our main contribution is a learning algorithm with a regret bound of [Formula: see text] for the inventory control problem. Here, [Formula: see text] is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the expected number of time steps needed to generate enough demand to deplete one unit of inventory. Notably, our regret bounds depend linearly on L, which significantly improves the previously best-known regret bounds for this problem where the dependence on L was exponential. Our techniques utilize the convexity of the long-run average cost and a newly derived bound on the “bias” of base-stock policies to establish an almost black box connection between the problem of learning in Markov decision processes (MDPs) with these properties and the stochastic convex bandit problem. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex asymptotic average cost functions. 
    more » « less
  5. Bouajjani, A. ; HolĂ­k, L. ; Wu, Z. (Ed.)
    The expanding role of reinforcement learning (RL) in safety-critical system design has promoted omega-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. When 𝜔-automata were first proposed in model-free RL, deterministic Rabin acceptance conditions were used in an attempt to provide a direct translation from omega-automata to finite state “reward” machines defined over the same automaton structure (a memoryless reward translation). While these initial attempts to provide faithful, memoryless reward translations for Rabin acceptance conditions remained unsuccessful, translations were discovered for other acceptance conditions such as suitable, limit-deterministic Buechi acceptance or more generally, good-for-MDP Buechi acceptance conditions. Yet, the question “whether a memoryless translation of Rabin conditions to scalar rewards exists” remained unresolved. This paper presents an impossibility result implying that any attempt to use Rabin automata directly (without extra memory) for model-free RL is bound to fail. To establish this result, we show a link between a class of automata enabling memoryless reward translation to closure properties of its accepting and rejecting infinity sets, and to the insight that both the property and its complement need to allow for positional strategies for such an approach to work. We believe that such impossibility results will provide foundations for the application of RL to safety-critical systems. 
    more » « less