The expanding role of reinforcement learning (RL) in safetycritical system design has promoted ωautomata as a way to express learning requirements—often nonMarkovian—with greater ease of expression and interpretation than scalar reward signals. However, realworld sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1)
Reinforcement Learning for OmegaRegular Specifications on ContinuousTime MDP
Continuoustime Markov decision processes (CTMDPs) are canonical models to express sequential decisionmaking under densetime and stochastic environments. When the stochastic evolution of the environment is only available via sampling, modelfree reinforcement learning (RL) is the algorithmofchoice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such transla tions manually is both tedious and errorprone, a number of techniques have been proposed to translate highlevel objec tives (expressed in logic or automata formalism) to scalar re wards for discretetime Markov decision processes. Unfortu nately, no automatic translation exists for CTMDPs.
We consider CTMDP environments against the learning objectives expressed as omegaregular languages. Omega regular languages generalize regular languages to infinite horizon specifications and can express properties given in popular lineartime logic LTL. To accommodate the dense time nature of CTMDPs, we consider two different semantics of omegaregular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spend ing positive time in the good states, and 2) expectation seman tics where the goal of the learner is to optimize the longrun expected average time spent in the “good states” of the au tomaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by offthe shelf RL algorithms for CTMDPs. We demonstrate the effec tiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omegaregular objectives.
more »
« less
 Award ID(s):
 2146563
 NSFPAR ID:
 10526022
 Publisher / Repository:
 Proceedings of the ThirtyThird International Conference on Automated Planning and Scheduling (ICAPS 2023)
 Date Published:
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

weighted preference , where the decision maker provides scalar weights for various objectives, and (2)lexicographic preference , where the decision maker provides an order over the objectives such that any amount of satisfaction of a higherordered objective is preferable to any amount of a lowerordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ωregular objectives under weighted and lexicographic preferences. We provide a translation from multiple ωregular objectives to a scalar reward signal that is bothfaithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) andeffective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool,Mungojerrie , and we present an experimental evaluation of our technique on benchmark learning problems. 
Sankaranarayanan, S. ; Sharygina, N. (Ed.)Mungojerrie is an extensible tool that provides a framework to translate lineartime objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for omegaregular objectives, and an internal probabilistic model checker for omegaregular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and omegaautomata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for omegaregular objectives in RL.more » « less

null (Ed.)Omegaregular properties—specified using linear time temporal logic or various forms of omegaautomata—find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for modelfree RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an Open image in new windowregular property to that for a simple reachability objective. A possible drawback of this translation is that reachability rewards are sparse, being reaped only at the end of each episode. Another approach reduces the search for an optimal policy to an optimization problem with two interdependent discount parameters. While this approach provides denser rewards than the reduction to reachability, it is not easily mapped to offtheshelf RL algorithms. We propose a reward scheme that reduces the search for an optimal policy to an optimization problem with a single discount parameter that produces dense rewards and is compatible with offtheshelf RL algorithms. Finally, we report an experimental comparison of these and other reward schemes for modelfree RL with omegaregular objectives.more » « less

Translating OmegaRegular Specifications to Average Objectives for ModelFree Reinforcement LearningPiotr Faliszewski ; Viviana Mascardi (Ed.)Recent success in reinforcement learning (RL) has brought renewed attention to the design of reward functions by which agent behavior is reinforced or deterred. Manually designing reward functions is tedious and errorprone. An alternative approach is to specify a formal, unambiguous logic requirement, which is automatically translated into a reward function to be learned from. Omegaregular languages, of which Linear Temporal Logic (LTL) is a subset, are a natural choice for specifying such requirements due to their use in verification and synthesis. However, current techniques based on omegaregular languages learn in an episodic manner whereby the environment is periodically reset to an initial state during learning. In some settings, this assumption is challenging or impossible to satisfy. Instead, in the continuing setting the agent explores the environment without resets over a single lifetime. This is a more natural setting for reasoning about omegaregular specifications defined over infinite traces of agent behavior. Optimizing the average reward instead of the usual discounted reward is more natural in this case due to the infinitehorizon objective that poses challenges to the convergence of discounted RL solutions. We restrict our attention to the omegaregular languages which correspond to absolute liveness specifications. These specifications cannot be invalidated by any finite prefix of agent behavior, in accordance with the spirit of a continuing problem. We propose a translation from absolute liveness omegaregular languages to an average reward objective for RL. Our reduction can be done onthefly, without full knowledge of the environment, thereby enabling the use of modelfree RL algorithms. Additionally, we propose a reward structure that enables RL without episodic resetting in communicating MDPs, unlike previous approaches. We demonstrate empirically with various benchmarks that our proposed method of using average reward RL for continuing tasks defined by omegaregular specifications is more effective than competing approaches that leverage discounted RL.more » « less

Bouajjani, A. ; Holík, L. ; Wu, Z. (Ed.)The expanding role of reinforcement learning (RL) in safetycritical system design has promoted omegaautomata as a way to express learning requirements—often nonMarkovian—with greater ease of expression and interpretation than scalar reward signals. When 𝜔automata were first proposed in modelfree RL, deterministic Rabin acceptance conditions were used in an attempt to provide a direct translation from omegaautomata to finite state “reward” machines defined over the same automaton structure (a memoryless reward translation). While these initial attempts to provide faithful, memoryless reward translations for Rabin acceptance conditions remained unsuccessful, translations were discovered for other acceptance conditions such as suitable, limitdeterministic Buechi acceptance or more generally, goodforMDP Buechi acceptance conditions. Yet, the question “whether a memoryless translation of Rabin conditions to scalar rewards exists” remained unresolved. This paper presents an impossibility result implying that any attempt to use Rabin automata directly (without extra memory) for modelfree RL is bound to fail. To establish this result, we show a link between a class of automata enabling memoryless reward translation to closure properties of its accepting and rejecting infinity sets, and to the insight that both the property and its complement need to allow for positional strategies for such an approach to work. We believe that such impossibility results will provide foundations for the application of RL to safetycritical systems.more » « less