The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1)weighted preference, where the decision maker provides scalar weights for various objectives, and (2)lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is bothfaithful(maximising reward means maximising probability of achieving the objectives under the corresponding preference) andeffective(RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool,Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems. 
                        more » 
                        « less   
                    
                            
                            Reinforcement Learning for Omega-Regular Specifications on Continuous-Time MDP
                        
                    
    
            Continuous-time Markov decision processes (CTMDPs) are canonical models to express sequential decision-making under dense-time and stochastic environments. When the stochastic evolution of the environment is only available via sampling, model-free reinforcement learning (RL) is the algorithm-of-choice to compute optimal decision sequence. RL, on the other hand, requires the learning objective to be encoded as scalar reward signals. Since doing such transla- tions manually is both tedious and error-prone, a number of techniques have been proposed to translate high-level objec- tives (expressed in logic or automata formalism) to scalar re- wards for discrete-time Markov decision processes. Unfortu- nately, no automatic translation exists for CTMDPs. We consider CTMDP environments against the learning objectives expressed as omega-regular languages. Omega- regular languages generalize regular languages to infinite- horizon specifications and can express properties given in popular linear-time logic LTL. To accommodate the dense- time nature of CTMDPs, we consider two different semantics of omega-regular objectives: 1) satisfaction semantics where the goal of the learner is to maximize the probability of spend- ing positive time in the good states, and 2) expectation seman- tics where the goal of the learner is to optimize the long-run expected average time spent in the “good states” of the au- tomaton. We present an approach enabling correct translation to scalar reward signals that can be readily used by off-the- shelf RL algorithms for CTMDPs. We demonstrate the effec- tiveness of the proposed algorithms by evaluating it on some popular CTMDP benchmarks with omega-regular objectives. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2146563
- PAR ID:
- 10526022
- Publisher / Repository:
- Proceedings of the Thirty-Third International Conference on Automated Planning and Scheduling (ICAPS 2023)
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Sankaranarayanan, S.; Sharygina, N. (Ed.)Mungojerrie is an extensible tool that provides a framework to translate linear-time objectives into reward for reinforcement learning (RL). The tool provides convergent RL algorithms for stochastic games, reference implementations of existing reward translations for omega-regular objectives, and an internal probabilistic model checker for omega-regular objectives. This functionality is modular and operates on shared data structures, which enables fast development of new translation techniques. Mungojerrie supports finite models specified in PRISM and omega-automata specified in the HOA format, with an integrated command line interface to external linear temporal logic translators. Mungojerrie is distributed with a set of benchmarks for omega-regular objectives in RL.more » « less
- 
            null (Ed.)Omega-regular properties—specified using linear time temporal logic or various forms of omega-automata—find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for model-free RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an Open image in new window-regular property to that for a simple reachability objective. A possible drawback of this translation is that reachability rewards are sparse, being reaped only at the end of each episode. Another approach reduces the search for an optimal policy to an optimization problem with two interdependent discount parameters. While this approach provides denser rewards than the reduction to reachability, it is not easily mapped to off-the-shelf RL algorithms. We propose a reward scheme that reduces the search for an optimal policy to an optimization problem with a single discount parameter that produces dense rewards and is compatible with off-the-shelf RL algorithms. Finally, we report an experimental comparison of these and other reward schemes for model-free RL with omega-regular objectives.more » « less
- 
            Translating Omega-Regular Specifications to Average Objectives for Model-Free Reinforcement LearningPiotr Faliszewski; Viviana Mascardi (Ed.)Recent success in reinforcement learning (RL) has brought renewed attention to the design of reward functions by which agent behavior is reinforced or deterred. Manually designing reward functions is tedious and error-prone. An alternative approach is to specify a formal, unambiguous logic requirement, which is automatically translated into a reward function to be learned from. Omega-regular languages, of which Linear Temporal Logic (LTL) is a subset, are a natural choice for specifying such requirements due to their use in verification and synthesis. However, current techniques based on omega-regular languages learn in an episodic manner whereby the environment is periodically reset to an initial state during learning. In some settings, this assumption is challenging or impossible to satisfy. Instead, in the continuing setting the agent explores the environment without resets over a single lifetime. This is a more natural setting for reasoning about omega-regular specifications defined over infinite traces of agent behavior. Optimizing the average reward instead of the usual discounted reward is more natural in this case due to the infinite-horizon objective that poses challenges to the convergence of discounted RL solutions. We restrict our attention to the omega-regular languages which correspond to absolute liveness specifications. These specifications cannot be invalidated by any finite prefix of agent behavior, in accordance with the spirit of a continuing problem. We propose a translation from absolute liveness omega-regular languages to an average reward objective for RL. Our reduction can be done on-the-fly, without full knowledge of the environment, thereby enabling the use of model-free RL algorithms. Additionally, we propose a reward structure that enables RL without episodic resetting in communicating MDPs, unlike previous approaches. We demonstrate empirically with various benchmarks that our proposed method of using average reward RL for continuing tasks defined by omega-regular specifications is more effective than competing approaches that leverage discounted RL.more » « less
- 
            Bouajjani, A.; Holík, L.; Wu, Z. (Ed.)The expanding role of reinforcement learning (RL) in safety-critical system design has promoted omega-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. When 𝜔-automata were first proposed in model-free RL, deterministic Rabin acceptance conditions were used in an attempt to provide a direct translation from omega-automata to finite state “reward” machines defined over the same automaton structure (a memoryless reward translation). While these initial attempts to provide faithful, memoryless reward translations for Rabin acceptance conditions remained unsuccessful, translations were discovered for other acceptance conditions such as suitable, limit-deterministic Buechi acceptance or more generally, good-for-MDP Buechi acceptance conditions. Yet, the question “whether a memoryless translation of Rabin conditions to scalar rewards exists” remained unresolved. This paper presents an impossibility result implying that any attempt to use Rabin automata directly (without extra memory) for model-free RL is bound to fail. To establish this result, we show a link between a class of automata enabling memoryless reward translation to closure properties of its accepting and rejecting infinity sets, and to the insight that both the property and its complement need to allow for positional strategies for such an approach to work. We believe that such impossibility results will provide foundations for the application of RL to safety-critical systems.more » « less
 An official website of the United States government
An official website of the United States government 
				
			
 
                                    