Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate re- ward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabili- ties of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and ω-regular languages are two formalisms used to express non-Markovian rewards for quantita- tive and qualitative objectives, respectively. This paper introduces ω- regular reward machines, which integrate reward machines with ω- regular languages to enable an expressive and effective reward mech- anism for RL. We present a model-free RL algorithm to compute ε-optimal strategies against ω-regular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.
more »
« less
Multi-objective ω-Regular Reinforcement Learning
The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1)weighted preference, where the decision maker provides scalar weights for various objectives, and (2)lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is bothfaithful(maximising reward means maximising probability of achieving the objectives under the corresponding preference) andeffective(RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool,Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.
more »
« less
- Award ID(s):
- 2009022
- PAR ID:
- 10528761
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- Formal Aspects of Computing
- Volume:
- 35
- Issue:
- 2
- ISSN:
- 0934-5043
- Page Range / eLocation ID:
- 1 to 24
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Huisman, M.; Păsăreanu, C.; Zhan, N. (Ed.)We study the problem of finding optimal strategies in Markov decision processes with lexicographic ω-regular objectives, which are ordered collections of ordinary ω-regular objectives. The goal is to compute strategies that maximise the probability of satisfaction of the first 𝜔-regular objective; subject to that, the strategy should also maximise the probability of satisfaction of the second ω-regular objective; then the third and so forth. For instance, one may want to guarantee critical requirements first, functional ones second and only then focus on the non-functional ones. We show how to harness the classic off-the-shelf model-free reinforcement learning techniques to solve this problem and evaluate their performance on four case studies.more » « less
-
Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such transla- tions manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objec- tives (expressed in logic or automata formalism) to scalar re- wards for discrete-time Markov decision processes. Unfortu- nately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega- regular languages generalize regular languages to infinite- horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense- time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spend- ing positive time in the good states, and 2) expectation seman- tics where the goal of the learner is to optimize the long-run expected average time spent in the “good states” of the au- tomaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the- shelf RL algorithms for CTMDPs. We demonstrate the effec- tiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.more » « less
-
Regular decision processes (RDPs) are a subclass of non- Markovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct rep- resentation of non-Markovian decision processes, their ex- pressive power coincides with finite-state Markov decision processes (MDPs). We introduce omega-regular decision pro- cesses (ODPs) where the non-Markovian aspect of the transi- tion and reward functions are extended to an ω-regular looka- head over the system evolution. Semantically, these looka- heads can be considered as promises made by the decision maker or the learning agent about her future behavior. In par- ticular, we assume that if the promised lookaheads are not fulfilled, then the decision maker receives a payoff of ⊥ (the least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discounted-reward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction.more » « less
-
Endriss, Ulle; Melo, Francisco (Ed.)Alternating-time temporal logic (ATL) extends branching time logic by enabling quantification over paths that result from the strategic choices made by multiple agents in various coalitions within the system. While classical temporal logics express properties of “closed” systems, ATL can express properties of “open” systems resulting from interactions among several agents. Reinforcement learning (RL) is a sampling-based approach to decision-making where learning agents, guided by a scalar reward function, discover optimal policies through repeated interactions with the environment. The challenge of translating high-level objectives into scalar rewards for RL has garnered increased interest, particularly following the success of model-free RL algorithms. This paper presents an approach for deploying model-free RL to verify multi-agent systems against ATL specifications. The key contribution of this paper is a verification procedure for model-free RL of quantitative and non-nested classic ATL properties, based on Q-learning, demonstrated on a natural subclass of non-nested ATL formulas.more » « less
An official website of the United States government

