Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Sankaranarayanan, S. ; Sharygina, N. (Ed.)Mungojerrie is an extensible tool that provides a framework to translate lineartime objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for omegaregular objectives, and an internal probabilistic model checker for omegaregular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and omegaautomata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for omegaregular objectives in RL.more » « less

Koyejo, S ; Mohamed, S. ; Agarwal, A. ; Belgrave, D. ; Cho, K. ; Oh, A. (Ed.)Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As stateoftheart reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and errorprone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with callstack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Qlearninga modelfree RL algorithm for RMDPsand prove that it converges for finite, singleexit and deterministic multiexit RMDPs under mild assumptions.more » « less

Bouajjani, A. ; Holík, L. ; Wu, Z. (Ed.)The expanding role of reinforcement learning (RL) in safetycritical system design has promoted omegaautomata as a way to express learning requirements—often nonMarkovian—with greater ease of expression and interpretation than scalar reward signals. When 𝜔automata were first proposed in modelfree RL, deterministic Rabin acceptance conditions were used in an attempt to provide a direct translation from omegaautomata to finite state “reward” machines defined over the same automaton structure (a memoryless reward translation). While these initial attempts to provide faithful, memoryless reward translations for Rabin acceptance conditions remained unsuccessful, translations were discovered for other acceptance conditions such as suitable, limitdeterministic Buechi acceptance or more generally, goodforMDP Buechi acceptance conditions. Yet, the question “whether a memoryless translation of Rabin conditions to scalar rewards exists” remained unresolved. This paper presents an impossibility result implying that any attempt to use Rabin automata directly (without extra memory) for modelfree RL is bound to fail. To establish this result, we show a link between a class of automata enabling memoryless reward translation to closure properties of its accepting and rejecting infinity sets, and to the insight that both the property and its complement need to allow for positional strategies for such an approach to work. We believe that such impossibility results will provide foundations for the application of RL to safetycritical systems.more » « less

Bouajjani, A. ; Holík, L. ; Wu, Z. (Ed.)When omegaregular objectives were first proposed in modelfree reinforcement learning (RL) for controlling MDPs, deterministic Rabin automata were used in an attempt to provide a direct translation from their transitions to scalar values. While these translations failed, it has turned out that it is possible to repair them by using goodforMDPs (GFM) Buechi automata instead. These are nondeterministic Buechi automata with a restricted type of nondeterminism, albeit not as restricted as in goodforgames automata. Indeed, deterministic Rabin automata have a pretty straightforward translation to such GFM automata, which is bilinear in the number of states and pairs. Interestingly, the same cannot be said for deterministic Streett automata: a translation to nondeterministic Rabin or Buechi automata comes at an exponential cost, even without requiring the target automaton to be goodforMDPs. Do we have to pay more than that to obtain a goodforMDPs automaton? The surprising answer is that we have to pay significantly less when we instead expand the goodforMDPs property to alternating automata: like the nondeterministic GFM automata obtained from deterministic Rabin automata, the alternating goodforMDPs automata we produce from deterministic Streett automata are bilinear in the size of the deterministic automaton and its index. They can therefore be exponentially more succinct than the minimal nondeterministic Buechi automaton.more » « less

Groote, J.F. ; Huisman, M. (Ed.)Reinforcement learning is a successful exploreandexploit approach, where a controller tries to learn how to navigate an unknown environment. The principle approach is for an intelligent agent to learn how to maximise expected rewards. But what happens if the objective refers to nonterminating systems? We can obviously not wait until an infinite amount of time has passed, assess the success, and update. But what can we do? This talk will tell.more » « less

null (Ed.)This paper investigates the use of modelfree reinforcement learning to compute the optimal value in twoplayer stochastic games with parity objectives. In this setting, two decision makers, player Min and player Max, compete on a finite game arena  a stochastic game graph with unknown but fixed probability distributions  to minimize and maximize, respectively, the probability of satisfying a parity objective. We give a reduction from stochastic parity games to a family of stochastic reachability games with a parameter ε, such that the value of a stochastic parity game equals the limit of the values of the corresponding simple stochastic games as the parameter ε tends to 0. Since this reduction does not require the knowledge of the probabilistic transition structure of the underlying game arena, modelfree reinforcement learning algorithms, such as minimax Qlearning, can be used to approximate the value and mutual bestresponse strategies for both players in the underlying stochastic parity game. We also present a streamlined reduction from 1 1/2player parity games to reachability games that avoids recourse to nondeterminism. Finally, we report on the experimental evaluations of both reductionsmore » « less