skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A PAC Learning Algorithm for LTL and Omega-Regular Objectives in MDPs
Linear temporal logic (LTL) and ω-regular objectives—a su- perset of LTL—have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We in- troduce a model-based probably approximately correct (PAC) learning algorithm for ω-regular objectives in Markov deci- sion processes (MDPs). As part of the development of our algorithm, we introduce the ε-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the ω-regular objective in the limit. We prove that our algo- rithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.  more » « less
Award ID(s):
2146563 2009022
PAR ID:
10526017
Author(s) / Creator(s):
; ;
Publisher / Repository:
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Goal-conditioned reinforcement learning (RL) is a powerful approach for learning general-purpose skills by reaching diverse goals. However, it has limitations when it comes to task-conditioned policies, where goals are specified by temporally extended instructions written in the Linear Temporal Logic (LTL) formal language. Existing approaches for finding LTL-satisfying policies rely on sampling a large set of LTL instructions during training to adapt to unseen tasks at inference time. However, these approaches do not guarantee generalization to out-of-distribution LTL objectives, which may have increased complexity. In this paper, we propose a novel approach to address this challenge. We show that simple goal-conditioned RL agents can be instructed to follow arbitrary LTL specifications without additional training over the LTL task space. Unlike existing approaches that focus on LTL specifications expressible as regular expressions, our technique is unrestricted and generalizes to ω-regular expressions. Experiment results demonstrate the effectiveness of our approach in adapting goal-conditioned RL agents to satisfy complex temporal logic task specifications zero-shot. 
    more » « less
  2. Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate re- ward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabili- ties of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and ω-regular languages are two formalisms used to express non-Markovian rewards for quantita- tive and qualitative objectives, respectively. This paper introduces ω- regular reward machines, which integrate reward machines with ω- regular languages to enable an expressive and effective reward mech- anism for RL. We present a model-free RL algorithm to compute ε-optimal strategies against ω-regular reward machines and evaluate the effectiveness of the proposed algorithm through experiments. 
    more » « less
  3. The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1)weighted preference, where the decision maker provides scalar weights for various objectives, and (2)lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is bothfaithful(maximising reward means maximising probability of achieving the objectives under the corresponding preference) andeffective(RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool,Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems. 
    more » « less
  4. Huisman, M.; Păsăreanu, C.; Zhan, N. (Ed.)
    We study the problem of finding optimal strategies in Markov decision processes with lexicographic ω-regular objectives, which are ordered collections of ordinary ω-regular objectives. The goal is to compute strategies that maximise the probability of satisfaction of the first 𝜔-regular objective; subject to that, the strategy should also maximise the probability of satisfaction of the second ω-regular objective; then the third and so forth. For instance, one may want to guarantee critical requirements first, functional ones second and only then focus on the non-functional ones. We show how to harness the classic off-the-shelf model-free reinforcement learning techniques to solve this problem and evaluate their performance on four case studies. 
    more » « less
  5. Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such transla- tions manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objec- tives (expressed in logic or automata formalism) to scalar re- wards for discrete-time Markov decision processes. Unfortu- nately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega- regular languages generalize regular languages to infinite- horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense- time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spend- ing positive time in the good states, and 2) expectation seman- tics where the goal of the learner is to optimize the long-run expected average time spent in the “good states” of the au- tomaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the- shelf RL algorithms for CTMDPs. We demonstrate the effec- tiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives. 
    more » « less