Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

A basic assumption of traditional reinforcement learning is that the value of a reward does not change once it is received by an agent. The present work forgoes this assumption and considers the situation where the value of a reward decays proportionally to the time elapsed since it was obtained. Emphasizing the inflection point occurring at the time of payment, we use the term asset to refer to a reward that is currently in the possession of an agent. Adopting this language, we initiate the study of depreciating assets within the framework of infinitehorizon quantitative optimization. In particular, we propose a notion of asset depreciation, inspired by classical exponential discounting, where the value of an asset is scaled by a fixed discount factor at each time step after it is obtained by the agent. We formulate an equational characterization of optimality in this context, establish that optimal values and policies can be computed efficiently, and develop a modelfree reinforcement learning approach to obtain optimal policies.more » « lessFree, publiclyaccessible full text available May 30, 2024

The successes of reinforcement learning in recent years are underpinned by the characterization of suitable reward functions. However, in settings where such rewards are nonintuitive, difficult to define, or otherwise errorprone in their definition, it is useful to instead learn the reward signal from expert demonstrations. This is the crux of inverse reinforcement learning (IRL). While eliciting learning requirements in the form of scalar reward signals has been shown to be effective, such representations lack explainability and lead to opaque learning. We aim to mitigate this situation by presenting a novel IRL method for eliciting declarative learning requirements in the form of a popular formal logicLinear Temporal Logic (LTL)from a set of traces given by the expert policy.more » « lessFree, publiclyaccessible full text available May 30, 2024

Sankaranarayanan, S. ; Sharygina, N. (Ed.)Mungojerrie is an extensible tool that provides a framework to translate lineartime objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for omegaregular objectives, and an internal probabilistic model checker for omegaregular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and omegaautomata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for omegaregular objectives in RL.more » « lessFree, publiclyaccessible full text available April 22, 2024

Chechik, M. ; Katoen, JP. ; Leucker, M. (Ed.)Efficient verification algorithms for neural networks often depend on various abstract domains such as intervals, zonotopes, and linear star sets. The choice of the abstract domain presents an expressiveness vs. scalability tradeoff: simpler domains are less precise but yield faster algorithms. This paper investigates the octatope abstract domain in the context of neural net verification. Octatopes are affine transformations of ndimensional octagons—sets of unittwovariableperinequality (UTVPI) constraints. Octatopes generalize the idea of zonotopes which can be viewed as an affine transformation of a box. On the other hand, octatopes can be considered as a restriction of linear star set, which are affine transformations of arbitrary HPolytopes. This distinction places octatopes firmly between zonotopes and star sets in their expressive power, but what about the efficiency of decision procedures? An important analysis problem for neural networks is the exact range computation problem that asks to compute the exact set of possible outputs given a set of possible inputs. For this, three computational procedures are needed: 1) optimization of a linear cost function; 2) affine mapping; and 3) overapproximating the intersection with a halfspace. While zonotopes allow an efficient solution for these approaches, star sets solves these procedures via linear programming. We show that these operations are faster for octatopes than the more expressive linear star sets. For octatopes, we reduce these problems to mincost flow problems, which can be solved in strongly polynomial time using the OutofKilter algorithm. Evaluating exact range computation on several ACAS Xu neural network benchmarks, we find that octatopes show promise as a practical abstract domain for neural network verification.more » « lessFree, publiclyaccessible full text available March 3, 2024

Koyejo, S ; Mohamed, S. ; Agarwal, A. ; Belgrave, D. ; Cho, K. ; Oh, A. (Ed.)Recursion is the fundamental paradigm to finitely describe potentially infinite objects. As stateoftheart reinforcement learning (RL) algorithms cannot directly reason about recursion, they must rely on the practitioner's ingenuity in designing a suitable "flat" representation of the environment. The resulting manual feature constructions and approximations are cumbersome and errorprone; their lack of transparency hampers scalability. To overcome these challenges, we develop RL algorithms capable of computing optimal policies in environments described as a collection of Markov decision processes (MDPs) that can recursively invoke one another. Each constituent MDP is characterized by several entry and exit points that correspond to input and output values of these invocations. These recursive MDPs (or RMDPs) are expressively equivalent to probabilistic pushdown systems (with callstack playing the role of the pushdown stack), and can model probabilistic programs with recursive procedural calls. We introduce Recursive Qlearninga modelfree RL algorithm for RMDPsand prove that it converges for finite, singleexit and deterministic multiexit RMDPs under mild assumptions.more » « less

Bouajjani, A ; Holík, L. ; Wu, Z. (Ed.)This paper presents an optimization based framework to automate system repair against omegaregular properties. In the proposed formalization of optimal repair, the systems are represented as Kripke structures, the properties as omegaregular languages, and the repair space as repair machines—weighted omegaregular transducers equipped with Büchi conditions—that rewrite strings and associate a cost sequence to these rewritings. To translate the resulting costsequences to easily interpretable payoffs, we consider several aggregator functions to map cost sequences to numbers—including limit superior, supremum, discountedsum, and averagesum—to define quantitative cost semantics. The problem of optimal repair, then, is to determine whether traces from a given system can be rewritten to satisfy an omegaregular property when the allowed cost is bounded by a given threshold. We also consider the dual challenge of impair verification that assumes that the rewritings are resolved adversarially under some given cost restriction, and asks to decide if all traces of the system satisfy the specification irrespective of the rewritings. With a negative result to the impair verification problem, we study the problem of designing a minimal mask of the Kripke structure such that the resulting traces satisfy the specifications despite the thresholdbounded impairment. We dub this problem as the mask synthesis problem. This paper presents automatatheoretic solutions to repair synthesis, impair verification, and mask synthesis problem for limit superior, supremum, discountedsum, and averagesum cost semantics.more » « less

Bouajjani, A. ; Holík, L. ; Wu, Z. (Ed.)The expanding role of reinforcement learning (RL) in safetycritical system design has promoted omegaautomata as a way to express learning requirements—often nonMarkovian—with greater ease of expression and interpretation than scalar reward signals. When 𝜔automata were first proposed in modelfree RL, deterministic Rabin acceptance conditions were used in an attempt to provide a direct translation from omegaautomata to finite state “reward” machines defined over the same automaton structure (a memoryless reward translation). While these initial attempts to provide faithful, memoryless reward translations for Rabin acceptance conditions remained unsuccessful, translations were discovered for other acceptance conditions such as suitable, limitdeterministic Buechi acceptance or more generally, goodforMDP Buechi acceptance conditions. Yet, the question “whether a memoryless translation of Rabin conditions to scalar rewards exists” remained unresolved. This paper presents an impossibility result implying that any attempt to use Rabin automata directly (without extra memory) for modelfree RL is bound to fail. To establish this result, we show a link between a class of automata enabling memoryless reward translation to closure properties of its accepting and rejecting infinity sets, and to the insight that both the property and its complement need to allow for positional strategies for such an approach to work. We believe that such impossibility results will provide foundations for the application of RL to safetycritical systems.more » « less

Groote, J.F. ; Huisman, M. (Ed.)Reinforcement learning is a successful exploreandexploit approach, where a controller tries to learn how to navigate an unknown environment. The principle approach is for an intelligent agent to learn how to maximise expected rewards. But what happens if the objective refers to nonterminating systems? We can obviously not wait until an infinite amount of time has passed, assess the success, and update. But what can we do? This talk will tell.more » « less