The expanding role of reinforcement learning (RL) in safetycritical system design has promoted ωautomata as a way to express learning requirements—often nonMarkovian—with greater ease of expression and interpretation than scalar reward signals. However, realworld sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1)
OmegaRegular Reward Machines
Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate re ward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabili ties of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and ωregular languages are two formalisms used to express nonMarkovian rewards for quantita tive and qualitative objectives, respectively. This paper introduces ω regular reward machines, which integrate reward machines with ω regular languages to enable an expressive and effective reward mech anism for RL. We present a modelfree RL algorithm to compute εoptimal strategies against ωregular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.
more »
« less
 NSFPAR ID:
 10526016
 Publisher / Repository:
 ECAI 2023 (Open Access by IOS Press)
 Date Published:
 Format(s):
 Medium: X
 Location:
 Kraków, Poland
 Sponsoring Org:
 National Science Foundation
More Like this

weighted preference , where the decision maker provides scalar weights for various objectives, and (2)lexicographic preference , where the decision maker provides an order over the objectives such that any amount of satisfaction of a higherordered objective is preferable to any amount of a lowerordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ωregular objectives under weighted and lexicographic preferences. We provide a translation from multiple ωregular objectives to a scalar reward signal that is bothfaithful (maximising reward means maximising probability of achieving the objectives under the corresponding preference) andeffective (RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool,Mungojerrie , and we present an experimental evaluation of our technique on benchmark learning problems. 
Continuoustime Markov decision processes (CTMDPs) are canonical models to express sequential decisionmaking under densetime and stochastic environments. When the stochastic evolution of the environment is only available via sampling, modelfree reinforcement learning (RL) is the algorithmofchoice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such transla tions manually is both tedious and errorprone, a number of techniques have been proposed to translate highlevel objec tives (expressed in logic or automata formalism) to scalar re wards for discretetime Markov decision processes. Unfortu nately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omegaregular languages. Omega regular languages generalize regular languages to infinite horizon specifications and can express properties given in popular lineartime logic LTL. To accommodate the dense time nature of CTMDPs, we consider two different semantics of omegaregular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spend ing positive time in the good states, and 2) expectation seman tics where the goal of the learner is to optimize the longrun expected average time spent in the “good states” of the au tomaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by offthe shelf RL algorithms for CTMDPs. We demonstrate the effec tiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omegaregular objectives.more » « less

Linear temporal logic (LTL) and ωregular objectives—a su perset of LTL—have seen recent use as a way to express nonMarkovian objectives in reinforcement learning. We in troduce a modelbased probably approximately correct (PAC) learning algorithm for ωregular objectives in Markov deci sion processes (MDPs). As part of the development of our algorithm, we introduce the εrecurrence time: a measure of the speed at which a policy converges to the satisfaction of the ωregular objective in the limit. We prove that our algo rithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.more » « less

Sankaranarayanan, S. ; Sharygina, N. (Ed.)Mungojerrie is an extensible tool that provides a framework to translate lineartime objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for omegaregular objectives, and an internal probabilistic model checker for omegaregular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and omegaautomata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for omegaregular objectives in RL.more » « less

Translating OmegaRegular Specifications to Average Objectives for ModelFree Reinforcement LearningPiotr Faliszewski ; Viviana Mascardi (Ed.)Recent success in reinforcement learning (RL) has brought renewed attention to the design of reward functions by which agent behavior is reinforced or deterred. Manually designing reward functions is tedious and errorprone. An alternative approach is to specify a formal, unambiguous logic requirement, which is automatically translated into a reward function to be learned from. Omegaregular languages, of which Linear Temporal Logic (LTL) is a subset, are a natural choice for specifying such requirements due to their use in verification and synthesis. However, current techniques based on omegaregular languages learn in an episodic manner whereby the environment is periodically reset to an initial state during learning. In some settings, this assumption is challenging or impossible to satisfy. Instead, in the continuing setting the agent explores the environment without resets over a single lifetime. This is a more natural setting for reasoning about omegaregular specifications defined over infinite traces of agent behavior. Optimizing the average reward instead of the usual discounted reward is more natural in this case due to the infinitehorizon objective that poses challenges to the convergence of discounted RL solutions. We restrict our attention to the omegaregular languages which correspond to absolute liveness specifications. These specifications cannot be invalidated by any finite prefix of agent behavior, in accordance with the spirit of a continuing problem. We propose a translation from absolute liveness omegaregular languages to an average reward objective for RL. Our reduction can be done onthefly, without full knowledge of the environment, thereby enabling the use of modelfree RL algorithms. Additionally, we propose a reward structure that enables RL without episodic resetting in communicating MDPs, unlike previous approaches. We demonstrate empirically with various benchmarks that our proposed method of using average reward RL for continuing tasks defined by omegaregular specifications is more effective than competing approaches that leverage discounted RL.more » « less