We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.
more »
« less
Translating OmegaRegular Specifications to Average Objectives for ModelFree Reinforcement Learning
Recent success in reinforcement learning (RL) has brought renewed attention to the design of reward functions by which agent behavior is reinforced or deterred. Manually designing reward functions is tedious and errorprone. An alternative approach is to specify a formal, unambiguous logic requirement, which is automatically translated into a reward function to be learned from. Omegaregular languages, of which Linear Temporal Logic (LTL) is a subset, are a natural choice for specifying such requirements due to their use in verification and synthesis. However, current techniques based on omegaregular languages learn in an episodic manner whereby the environment is periodically reset to an initial state during learning. In some settings, this assumption is challenging or impossible to satisfy. Instead, in the continuing setting the agent explores the environment without resets over a single lifetime. This is a more natural setting for reasoning about omegaregular specifications defined over infinite traces of agent behavior. Optimizing the average reward instead of the usual discounted reward is more natural in this case due to the infinitehorizon objective that poses challenges to the convergence of discounted RL solutions.
We restrict our attention to the omegaregular languages which correspond to absolute liveness specifications. These specifications cannot be invalidated by any finite prefix of agent behavior, in accordance with the spirit of a continuing problem. We propose a translation from absolute liveness omegaregular languages to an average reward objective for RL. Our reduction can be done onthefly, without full knowledge of the environment, thereby enabling the use of modelfree RL algorithms. Additionally, we propose a reward structure that enables RL without episodic resetting in communicating MDPs, unlike previous approaches. We demonstrate empirically with various benchmarks that our proposed method of using average reward RL for continuing tasks defined by omegaregular specifications is more effective than competing approaches that leverage discounted RL.
more »
« less
 NSFPAR ID:
 10329431
 Editor(s):
 Piotr Faliszewski; Viviana Mascardi
 Date Published:
 Journal Name:
 Proc. of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2022),
 Page Range / eLocation ID:
 732741
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)Omegaregular properties—specified using linear time temporal logic or various forms of omegaautomata—find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for modelfree RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an Open image in new windowregular property to that for a simple reachability objective. A possible drawback of this translation is that reachability rewards are sparse, being reaped only at the end of each episode. Another approach reduces the search for an optimal policy to an optimization problem with two interdependent discount parameters. While this approach provides denser rewards than the reduction to reachability, it is not easily mapped to offtheshelf RL algorithms. We propose a reward scheme that reduces the search for an optimal policy to an optimization problem with a single discount parameter that produces dense rewards and is compatible with offtheshelf RL algorithms. Finally, we report an experimental comparison of these and other reward schemes for modelfree RL with omegaregular objectives.more » « less

Bouajjani, A ; Holík, L. ; Wu, Z. (Ed.)This paper presents an optimization based framework to automate system repair against omegaregular properties. In the proposed formalization of optimal repair, the systems are represented as Kripke structures, the properties as omegaregular languages, and the repair space as repair machines—weighted omegaregular transducers equipped with Büchi conditions—that rewrite strings and associate a cost sequence to these rewritings. To translate the resulting costsequences to easily interpretable payoffs, we consider several aggregator functions to map cost sequences to numbers—including limit superior, supremum, discountedsum, and averagesum—to define quantitative cost semantics. The problem of optimal repair, then, is to determine whether traces from a given system can be rewritten to satisfy an omegaregular property when the allowed cost is bounded by a given threshold. We also consider the dual challenge of impair verification that assumes that the rewritings are resolved adversarially under some given cost restriction, and asks to decide if all traces of the system satisfy the specification irrespective of the rewritings. With a negative result to the impair verification problem, we study the problem of designing a minimal mask of the Kripke structure such that the resulting traces satisfy the specifications despite the thresholdbounded impairment. We dub this problem as the mask synthesis problem. This paper presents automatatheoretic solutions to repair synthesis, impair verification, and mask synthesis problem for limit superior, supremum, discountedsum, and averagesum cost semantics.more » « less

Sankaranarayanan, S. ; Sharygina, N. (Ed.)Mungojerrie is an extensible tool that provides a framework to translate lineartime objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for omegaregular objectives, and an internal probabilistic model checker for omegaregular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and omegaautomata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for omegaregular objectives in RL.more » « less

Recent reinforcement learning (RL) approaches have shown strong performance in complex domains such as Atari games, but are often highly sample inefficient. A common approach to reduce interaction time with the environment is to use reward shaping, which involves carefully designing reward functions that provide the agent intermediate rewards for progress towards the goal. However, designing appropriate shaping rewards is known to be difficult as well as timeconsuming. In this work, we address this problem by using natural language instructions to perform reward shaping. We propose the LanguagEAction Reward Network (LEARN), a framework that maps freeform natural language instructions to intermediate rewards based on actions taken by the agent. These intermediate languagebased rewards can seamlessly be integrated into any standard reinforcement learning algorithm. We experiment with Montezuma’s Revenge from the Atari Learning Environment, a popular benchmark in RL. Our experiments on a diverse set of 15 tasks demonstrate that, for the same number of interactions with the environment, languagebased rewards lead to successful completion of the task 60 % more often on average, compared to learning without language.more » « less