Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Sankaranarayanan, S. ; Sharygina, N. (Ed.)Mungojerrie is an extensible tool that provides a framework to translate lineartime objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for omegaregular objectives, and an internal probabilistic model checker for omegaregular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and omegaautomata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for omegaregular objectives in RL.more » « less

Koyejo, S ; Mohamed, S. ; Agarwal, A. ; Belgrave, D. ; Cho, K. ; Oh, A. (Ed.)Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As stateoftheart reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and errorprone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with callstack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Qlearninga modelfree RL algorithm for RMDPsand prove that it converges for finite, singleexit and deterministic multiexit RMDPs under mild assumptions.more » « less

Bouajjani, A. ; Holík, L. ; Wu, Z. (Ed.)The expanding role of reinforcement learning (RL) in safetycritical system design has promoted omegaautomata as a way to express learning requirements—often nonMarkovian—with greater ease of expression and interpretation than scalar reward signals. When 𝜔automata were first proposed in modelfree RL, deterministic Rabin acceptance conditions were used in an attempt to provide a direct translation from omegaautomata to finite state “reward” machines defined over the same automaton structure (a memoryless reward translation). While these initial attempts to provide faithful, memoryless reward translations for Rabin acceptance conditions remained unsuccessful, translations were discovered for other acceptance conditions such as suitable, limitdeterministic Buechi acceptance or more generally, goodforMDP Buechi acceptance conditions. Yet, the question “whether a memoryless translation of Rabin conditions to scalar rewards exists” remained unresolved. This paper presents an impossibility result implying that any attempt to use Rabin automata directly (without extra memory) for modelfree RL is bound to fail. To establish this result, we show a link between a class of automata enabling memoryless reward translation to closure properties of its accepting and rejecting infinity sets, and to the insight that both the property and its complement need to allow for positional strategies for such an approach to work. We believe that such impossibility results will provide foundations for the application of RL to safetycritical systems.more » « less

Bouajjani, A. ; Holík, L. ; Wu, Z. (Ed.)When omegaregular objectives were first proposed in modelfree reinforcement learning (RL) for controlling MDPs, deterministic Rabin automata were used in an attempt to provide a direct translation from their transitions to scalar values. While these translations failed, it has turned out that it is possible to repair them by using goodforMDPs (GFM) Buechi automata instead. These are nondeterministic Buechi automata with a restricted type of nondeterminism, albeit not as restricted as in goodforgames automata. Indeed, deterministic Rabin automata have a pretty straightforward translation to such GFM automata, which is bilinear in the number of states and pairs. Interestingly, the same cannot be said for deterministic Streett automata: a translation to nondeterministic Rabin or Buechi automata comes at an exponential cost, even without requiring the target automaton to be goodforMDPs. Do we have to pay more than that to obtain a goodforMDPs automaton? The surprising answer is that we have to pay significantly less when we instead expand the goodforMDPs property to alternating automata: like the nondeterministic GFM automata obtained from deterministic Rabin automata, the alternating goodforMDPs automata we produce from deterministic Streett automata are bilinear in the size of the deterministic automaton and its index. They can therefore be exponentially more succinct than the minimal nondeterministic Buechi automaton.more » « less

Groote, J.F. ; Huisman, M. (Ed.)Reinforcement learning is a successful exploreandexploit approach, where a controller tries to learn how to navigate an unknown environment. The principle approach is for an intelligent agent to learn how to maximise expected rewards. But what happens if the objective refers to nonterminating systems? We can obviously not wait until an infinite amount of time has passed, assess the success, and update. But what can we do? This talk will tell.more » « less

Translating OmegaRegular Specifications to Average Objectives for ModelFree Reinforcement LearningPiotr Faliszewski ; Viviana Mascardi (Ed.)Recent success in reinforcement learning (RL) has brought renewed attention to the design of reward functions by which agent behavior is reinforced or deterred. Manually designing reward functions is tedious and errorprone. An alternative approach is to specify a formal, unambiguous logic requirement, which is automatically translated into a reward function to be learned from. Omegaregular languages, of which Linear Temporal Logic (LTL) is a subset, are a natural choice for specifying such requirements due to their use in verification and synthesis. However, current techniques based on omegaregular languages learn in an episodic manner whereby the environment is periodically reset to an initial state during learning. In some settings, this assumption is challenging or impossible to satisfy. Instead, in the continuing setting the agent explores the environment without resets over a single lifetime. This is a more natural setting for reasoning about omegaregular specifications defined over infinite traces of agent behavior. Optimizing the average reward instead of the usual discounted reward is more natural in this case due to the infinitehorizon objective that poses challenges to the convergence of discounted RL solutions. We restrict our attention to the omegaregular languages which correspond to absolute liveness specifications. These specifications cannot be invalidated by any finite prefix of agent behavior, in accordance with the spirit of a continuing problem. We propose a translation from absolute liveness omegaregular languages to an average reward objective for RL. Our reduction can be done onthefly, without full knowledge of the environment, thereby enabling the use of modelfree RL algorithms. Additionally, we propose a reward structure that enables RL without episodic resetting in communicating MDPs, unlike previous approaches. We demonstrate empirically with various benchmarks that our proposed method of using average reward RL for continuing tasks defined by omegaregular specifications is more effective than competing approaches that leverage discounted RL.more » « less

Huisman, M. ; Păsăreanu, C. ; Zhan, N. (Ed.)We study the problem of finding optimal strategies in Markov decision processes with lexicographic ωregular objectives, which are ordered collections of ordinary ωregular objectives. The goal is to compute strategies that maximise the probability of satisfaction of the first 𝜔regular objective; subject to that, the strategy should also maximise the probability of satisfaction of the second ωregular objective; then the third and so forth. For instance, one may want to guarantee critical requirements first, functional ones second and only then focus on the nonfunctional ones. We show how to harness the classic offtheshelf modelfree reinforcement learning techniques to solve this problem and evaluate their performance on four case studies.more » « less

Silva, A. ; Leino, K.R.M. (Ed.)We study reinforcement learning for the optimal control of Branching Markov Decision Processes (BMDPs), a natural extension of (multitype) Branching Markov Chains (BMCs). The state of a (discretetime) BMCs is a collection of entities of various types that, while spawning other entities, generate a payoff. In comparison with BMCs, where the evolution of a each entity of the same type follows the same probabilistic pattern, BMDPs allow an external controller to pick from a range of options. This permits us to study the best/worst behaviour of the system. We generalise modelfree reinforcement learning techniques to compute an optimal control strategy of an unknown BMDP in the limit. We present results of an implementation that demonstrate the practicality of the approach.more » « less

null (Ed.)Omegaregular properties—specified using linear time temporal logic or various forms of omegaautomata—find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for modelfree RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an Open image in new windowregular property to that for a simple reachability objective. A possible drawback of this translation is that reachability rewards are sparse, being reaped only at the end of each episode. Another approach reduces the search for an optimal policy to an optimization problem with two interdependent discount parameters. While this approach provides denser rewards than the reduction to reachability, it is not easily mapped to offtheshelf RL algorithms. We propose a reward scheme that reduces the search for an optimal policy to an optimization problem with a single discount parameter that produces dense rewards and is compatible with offtheshelf RL algorithms. Finally, we report an experimental comparison of these and other reward schemes for modelfree RL with omegaregular objectives.more » « less