Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Regular decision processes (RDPs) are a subclass of non Markovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct rep resentation of nonMarkovian decision processes, their ex pressive power coincides with finitestate Markov decision processes (MDPs). We introduce omegaregular decision pro cesses (ODPs) where the nonMarkovian aspect of the transi tion and reward functions are extended to an ωregular looka head over the system evolution. Semantically, these looka heads can be considered as promises made by the decision maker or the learning agent about her future behavior. In par ticular, we assume that if the promised lookaheads are not fulfilled, then the decision maker receives a payoff of ⊥ (the least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discountedreward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction.more » « lessFree, publiclyaccessible full text available April 20, 2025

Linear temporal logic (LTL) and ωregular objectives—a su perset of LTL—have seen recent use as a way to express nonMarkovian objectives in reinforcement learning. We in troduce a modelbased probably approximately correct (PAC) learning algorithm for ωregular objectives in Markov deci sion processes (MDPs). As part of the development of our algorithm, we introduce the εrecurrence time: a measure of the speed at which a policy converges to the satisfaction of the ωregular objective in the limit. We prove that our algo rithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.more » « lessFree, publiclyaccessible full text available February 20, 2025

We present a modular approach to reinforcement learning (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assumeguarantee paradigm where the optimal control for the individual components is synthesized in isolation by making assumptions about the behaviors of neighboring components, and providing guarantees about their own behavior. We express these assumeguarantee contracts as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of sat isfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies.more » « lessFree, publiclyaccessible full text available February 20, 2025

Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate re ward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabili ties of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and ωregular languages are two formalisms used to express nonMarkovian rewards for quantita tive and qualitative objectives, respectively. This paper introduces ω regular reward machines, which integrate reward machines with ω regular languages to enable an expressive and effective reward mech anism for RL. We present a modelfree RL algorithm to compute εoptimal strategies against ωregular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.more » « less

Enea, C ; Lal, A (Ed.)The difficulty of manually specifying reward functions has led to an interest in using linear temporal logic (LTL) to express objec tives for reinforcement learning (RL). However, LTL has the downside that it is sensitive to small perturbations in the transition probabilities, which prevents probably approximately correct (PAC) learning without additional assumptions. Time discounting provides a way of removing this sensitivity, while retaining the high expressivity of the logic. We study the use of discounted LTL for policy synthesis in Markov decision processes with unknown transition probabilities, and show how to reduce discounted LTL to discountedsum reward via a reward machine when all discount factors are identical.more » « less

Huisman, M. ; Păsăreanu, C. ; Zhan, N. (Ed.)We study the problem of finding optimal strategies in Markov decision processes with lexicographic ωregular objectives, which are ordered collections of ordinary ωregular objectives. The goal is to compute strategies that maximise the probability of satisfaction of the first 𝜔regular objective; subject to that, the strategy should also maximise the probability of satisfaction of the second ωregular objective; then the third and so forth. For instance, one may want to guarantee critical requirements first, functional ones second and only then focus on the nonfunctional ones. We show how to harness the classic offtheshelf modelfree reinforcement learning techniques to solve this problem and evaluate their performance on four case studies.more » « less

Silva, A. ; Leino, K.R.M. (Ed.)We study reinforcement learning for the optimal control of Branching Markov Decision Processes (BMDPs), a natural extension of (multitype) Branching Markov Chains (BMCs). The state of a (discretetime) BMCs is a collection of entities of various types that, while spawning other entities, generate a payoff. In comparison with BMCs, where the evolution of a each entity of the same type follows the same probabilistic pattern, BMDPs allow an external controller to pick from a range of options. This permits us to study the best/worst behaviour of the system. We generalise modelfree reinforcement learning techniques to compute an optimal control strategy of an unknown BMDP in the limit. We present results of an implementation that demonstrate the practicality of the approach.more » « less