Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

A barrier certificate, defined over the states of a dynamical system, is a realvalued function whose zero level set characterizes an in ductively verifiable state invariant separating reachable states from unsafe ones. When combined with powerful decision procedures— such as sumofsquares programming (SOS) or satisfiabilitymodulo theory solvers (SMT)—barrier certificates enable an automated de ductive verification approach to safety. The barrier certificate ap proach has been extended to refute LTL and l regular specifications by separating consecutive transitions of corresponding l automata in the hope of denying all accepting runs. Unsurprisingly, such tactics are bound to be conservative as refutation of recurrence properties requires reasoning about the wellfoundedness of the transitive closure of the transition relation. This paper introduces the notion of closure certificates as a natural extension of barrier certificates from state invariants to transition invariants. We aug ment these definitions with SOS and SMT based characterization for automating the search of closure certificates and demonstrate their effectiveness over some case studies.more » « lessFree, publiclyaccessible full text available May 14, 2025

Regular decision processes (RDPs) are a subclass of non Markovian decision processes where the transition and reward functions are guarded by some regular property of the past (a lookback). While RDPs enable intuitive and succinct rep resentation of nonMarkovian decision processes, their ex pressive power coincides with finitestate Markov decision processes (MDPs). We introduce omegaregular decision pro cesses (ODPs) where the nonMarkovian aspect of the transi tion and reward functions are extended to an ωregular looka head over the system evolution. Semantically, these looka heads can be considered as promises made by the decision maker or the learning agent about her future behavior. In par ticular, we assume that if the promised lookaheads are not fulfilled, then the decision maker receives a payoff of ⊥ (the least desirable payoff), overriding any rewards collected by the decision maker. We enable optimization and learning for ODPs under the discountedreward objective by reducing them to lexicographic optimization and learning over finite MDPs. We present experimental results demonstrating the effectiveness of the proposed reduction.more » « lessFree, publiclyaccessible full text available April 20, 2025

Linear temporal logic (LTL) and ωregular objectives—a su perset of LTL—have seen recent use as a way to express nonMarkovian objectives in reinforcement learning. We in troduce a modelbased probably approximately correct (PAC) learning algorithm for ωregular objectives in Markov deci sion processes (MDPs). As part of the development of our algorithm, we introduce the εrecurrence time: a measure of the speed at which a policy converges to the satisfaction of the ωregular objective in the limit. We prove that our algo rithm only requires a polynomial number of samples in the relevant parameters, and perform experiments which confirm our theory.more » « lessFree, publiclyaccessible full text available February 20, 2025

Notions of transition invariants and closure certificates have seen recent use in the formal verification of controlled dy namical systems against ωregular properties. The existing approaches face limitations in two directions. First, they re quire a closedform mathematical expression representing the model of the system. Such an expression may be difficult to find, too complex to be of any use, or unavailable due to security or privacy constraints. Second, finding such invari ants typically rely on optimization techniques such as sumof squares (SOS) or satisfiability modulo theory (SMT) solvers. This restricts the classes of systems that need to be formally verified. To address these drawbacks, we introduce a notion of neural closure certificates. We present a datadriven algo rithm that trains a neural network to represent a closure cer tificate. Our approach is formally correct under some mild as sumptions, i.e., one is able to formally show that the unknown system satisfies the ωregular property of interest if a neural closure certificate can be computed. Finally, we demonstrate the efficacy of our approach with relevant case studies.more » « lessFree, publiclyaccessible full text available February 20, 2025

We present a modular approach to reinforcement learning (RL) in environments consisting of simpler components evolving in parallel. A monolithic view of such modular environments may be prohibitively large to learn, or may require unrealizable communication between the components in the form of a centralized controller. Our proposed approach is based on the assumeguarantee paradigm where the optimal control for the individual components is synthesized in isolation by making assumptions about the behaviors of neighboring components, and providing guarantees about their own behavior. We express these assumeguarantee contracts as regular languages and provide automatic translations to scalar rewards to be used in RL. By combining local probabilities of satisfaction for each component, we provide a lower bound on the probability of sat isfaction of the complete system. By solving a Markov game for each component, RL can produce a controller for each component that maximizes this lower bound. The controller utilizes the information it receives through communication, observations, and any knowledge of a coarse model of other agents. We experimentally demonstrate the efficiency of the proposed approach on a variety of case studies.more » « lessFree, publiclyaccessible full text available February 20, 2025

Reinforcement learning (RL) is a powerful approach for training agents to perform tasks, but designing an appropriate re ward mechanism is critical to its success. However, in many cases, the complexity of the learning objectives goes beyond the capabili ties of the Markovian assumption, necessitating a more sophisticated reward mechanism. Reward machines and ωregular languages are two formalisms used to express nonMarkovian rewards for quantita tive and qualitative objectives, respectively. This paper introduces ω regular reward machines, which integrate reward machines with ω regular languages to enable an expressive and effective reward mech anism for RL. We present a modelfree RL algorithm to compute εoptimal strategies against ωregular reward machines and evaluate the effectiveness of the proposed algorithm through experiments.more » « less

Continuoustime Markov decision processes (CTMDPs) are canonical models to express sequential decisionmaking under densetime and stochastic environments. When the stochastic evolution of the environment is only available via sampling, modelfree reinforcement learning (RL) is the algorithmofchoice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such transla tions manually is both tedious and errorprone, a number of techniques have been proposed to translate highlevel objec tives (expressed in logic or automata formalism) to scalar re wards for discretetime Markov decision processes. Unfortu nately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omegaregular languages. Omega regular languages generalize regular languages to infinite horizon specifications and can express properties given in popular lineartime logic LTL. To accommodate the dense time nature of CTMDPs, we consider two different semantics of omegaregular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spend ing positive time in the good states, and 2) expectation seman tics where the goal of the learner is to optimize the longrun expected average time spent in the “good states” of the au tomaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by offthe shelf RL algorithms for CTMDPs. We demonstrate the effec tiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omegaregular objectives.more » « less

A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinitehorizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate an equational characterization of optimality in this context, establish that optimal values and policies can be computed efficiently, and develop a modelfree reinforcement learning approach to obtain optimal policies.more » « less

Enea, C ; Lal, A (Ed.)The difficulty of manually specifying reward functions has led to an interest in using linear temporal logic (LTL) to express objec tives for reinforcement learning (RL). However, LTL has the downside that it is sensitive to small perturbations in the transition probabilities, which prevents probably approximately correct (PAC) learning without additional assumptions. Time discounting provides a way of removing this sensitivity, while retaining the high expressivity of the logic. We study the use of discounted LTL for policy synthesis in Markov decision processes with unknown transition probabilities, and show how to reduce discounted LTL to discountedsum reward via a reward machine when all discount factors are identical.more » « less

The successes of reinforcement learning in recent years are underpinned by the characterization of suitable reward functions. However, in settings where such rewards are nonintuitive, difficult to define, or otherwise errorprone in their definition, it is useful to instead learn the reward signal from expert demonstrations. This is the crux of inverse reinforcement learning (IRL). While eliciting learning requirements in the form of scalar reward signals has been shown to be effective, such representations lack explainability and lead to opaque learning. We aim to mitigate this situation by presenting a novel IRL method for eliciting declarative learning requirements in the form of a popular formal logicLinear Temporal Logic (LTL)from a set of traces given by the expert policy.more » « less