The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1)weighted preference, where the decision maker provides scalar weights for various objectives, and (2)lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is bothfaithful(maximising reward means maximising probability of achieving the objectives under the corresponding preference) andeffective(RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool,Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems.
more »
« less
Model-Free Reinforcement Learning for Lexicographic Omega-Regular Objectives
We study the problem of finding optimal strategies in Markov decision processes with lexicographic ω-regular objectives, which are ordered collections of ordinary ω-regular objectives. The goal is to compute strategies that maximise the probability of satisfaction of the first 𝜔-regular objective; subject to that, the strategy should also maximise the probability of satisfaction of the second ω-regular objective; then the third and so forth. For instance, one may want to guarantee critical requirements first, functional ones second and only then focus on the non-functional ones. We show how to harness the classic off-the-shelf model-free reinforcement learning techniques to solve this problem and evaluate their performance on four case studies.
more »
« less
- Award ID(s):
- 2009022
- PAR ID:
- 10329426
- Editor(s):
- Huisman, M.; Păsăreanu, C.; Zhan, N.
- Date Published:
- Journal Name:
- Formal Methods (FM 2021)
- Page Range / eLocation ID:
- 142-159
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Linear temporal logic (LTL) and ω-regular objectives—a su- perset of LTL—have seen recent use as a way to express non-Markovian objectives in reinforcement learning. We in- troduce a model-based probably approximately correct (PAC) learning algorithm for ω-regular objectives in Markov deci- sion processes (MDPs). As part of the development of our algorithm, we introduce the ε-recurrence time: a measure of the speed at which a policy converges to the satisfaction of the ω-regular objective in the limit. We prove that our algo- rithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.more » « less
-
Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate re- ward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabili- ties of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and ω-regular languages are two formalisms used to express non-Markovian rewards for quantita- tive and qualitative objectives, respectively. This paper introduces ω- regular reward machines, which integrate reward machines with ω- regular languages to enable an expressive and effective reward mech- anism for RL. We present a model-free RL algorithm to compute ε-optimal strategies against ω-regular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.more » « less
-
Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such transla- tions manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objec- tives (expressed in logic or automata formalism) to scalar re- wards for discrete-time Markov decision processes. Unfortu- nately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega- regular languages generalize regular languages to infinite- horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense- time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spend- ing positive time in the good states, and 2) expectation seman- tics where the goal of the learner is to optimize the long-run expected average time spent in the “good states” of the au- tomaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the- shelf RL algorithms for CTMDPs. We demonstrate the effec- tiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives.more » « less
-
Abstract. In groundwater pumping optimization (GPO), offline-trained data-driven surrogates can be used to replace numerical-intensive simulators in order to save computing time. The traditional offline training approach involves building surrogates prior to optimization, fitting training datasets that cover the input space uniformly or randomly, which can prove inefficient due to the potential oversampling of low-gradient areas and under-sampling of high-gradient areas. This study proposes an offline machine-learning (ML) algorithm that ranks candidate training points by scoring them based on their distance to the closest training point and on the local gradient of the surrogate estimate and then choosing the highest-rank point. This method is applied to develop surrogates for solving a two-objective GPO problem formulated on a three-dimensional (3D) island aquifer, using hydrogeological conditions representative of San Salvador Island, Bahamas. The objectives are to minimise the supply cost (fOC) resulting from groundwater pumping and desalination and maximise fresh groundwater supply (Qp), subject to constraints on seawater intrusion (SWI) control expressed in terms of aquifer drawdown Δs at pumping locations and aquifer salt mass increase ΔSM. Gaussian Process (GP) is the technique applied to construct surrogates of objectives and constraints, alongside the estimation of uncertainties. Using GP models, it is possible to estimate the probability of “Pareto optimality” for each pumping scheme by Monte Carlo simulation. Pareto optimal pumping schemes (POPS) are then characterized by a probability of occurrence, which can be verified by numerical simulation. The GP training strategy's effectiveness in generating POPS is compared to traditional training approaches, showing that such a strategy can efficiently identify reliable POPS.more » « less
An official website of the United States government

