In recent years, researchers have made significant progress in devising reinforcementlearning algorithms for optimizing linear temporal logic (LTL) objectives and LTLlike objectives.Despite these advancements, there are fundamental limitations to how well this problem can be solved. Previous studies have alluded to this fact but have not examined it in depth.In this paper, we address the tractability of reinforcement learning for general LTL objectives from a theoretical perspective.We formalize the problem under the probably approximately correct learning in Markov decision processes (PACMDP) framework, a standard framework for measuring sample complexity in reinforcement learning.In this formalization, we prove that the optimal policy for any LTL formula is PACMDPlearnable if and only if the formula is in the most limited class in the LTL hierarchy, consisting of formulas that are decidable within a finite horizon.Practically, our result implies that it is impossible for a reinforcementlearning algorithm to obtain a PACMDP guarantee on the performance of its learned policy after finitely many interactions with an unconstrained environment for LTL objectives that are not decidable within a finite horizon.
Policy Optimization with Linear Temporal Logic Constraints
We study the problem of policy optimization (PO) with linear temporal logic
(LTL) constraints. The language of LTL allows flexible description of tasks
that may be unnatural to encode as a scalar cost function. We consider
LTLconstrained PO as a systematic framework, decoupling task specification
from policy selection, and as an alternative to the standard of cost shaping.
With access to a generative model, we develop a modelbased approach that
enjoys a sample complexity analysis for guaranteeing both task satisfaction and
cost optimality (through a reduction to a reachability problem). Empirically,
our algorithm can achieve strong performance even in lowsample regimes.
more »
« less
 Award ID(s):
 1918839
 NSFPAR ID:
 10404359
 Date Published:
 Journal Name:
 36th Conference on Neural Information Processing Systems (NeurIPS 2022)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Many systems are naturally modeled as Markov Decision Processes (MDPs), combining probabilities and strategic actions. Given a model of a system as an MDP and some logical specification of system behavior, the goal of synthesis is to find a policy that maximizes the probability of achieving this behavior. A popular choice for defining behaviors is Linear Temporal Logic (LTL). Policy synthesis on MDPs for properties specified in LTL has been well studied. LTL, however, is defined over infinite traces, while many properties of interest are inherently finite. Linear Temporal Logic over finite traces (LTLf ) has been used to express such properties, but no tools exist to solve policy synthesis for MDP behaviors given finitetrace properties. We present two algorithms for solving this synthesis problem: the first via reduction of LTLf to LTL and the second using native tools for LTLf . We compare the scalability of these two approaches for synthesis and show that the native approach offers better scalability compared to existing automaton generation tools for LTL.more » « less

Prior work on automatic control synthesis for cyberphysical systems under logical constraints has primarily focused on environmental disturbances or modeling uncertainties, however, the impact of deliberate and malicious attacks has been less studied. In this paper, we consider a discretetime dynamical system with a linear temporal logic (LTL) constraint in the presence of an adversary, which is modeled as a stochastic game. We assume that the adversary observes the control policy before choosing an attack strategy. We investigate two problems. In the first problem, we synthesize a robust control policy for the stochastic game that maximizes the probability of satisfying the LTL constraint. A value iteration based algorithm is proposed to compute the optimal control policy. In the second problem, we focus on a subclass of LTL constraints, which consist of an arbitrary LTL formula and an invariant constraint. We then investigate the problem of computing a control policy that minimizes the expected number of invariant constraint violations while maximizing the probability of satisfying the arbitrary LTL constraint. We characterize the optimality condition for the desired control policy. A policy iteration based algorithm is proposed to compute the control policy. We illustrate the proposed approaches using two numerical case studies.more » « less

Integrating Symbolic Planning and Reinforcement Learning for Following Temporal Logic SpecificationsTeaching a deep reinforcement learning (RL) agent to follow instructions in multitask environments is a challenging problem. We consider that user defines every task by a linear temporal logic (LTL) formula. However, some causal dependencies in complex environments may be unknown to the user in advance. Hence, when human user is specifying instructions, the robot cannot solve the tasks by simply following the given instructions. In this work, we propose a hierarchical reinforcement learning (HRL) framework in which a symbolic transition model is learned to efficiently produce highlevel plans that can guide the agent efficiently solve different tasks. Specifically, the symbolic transition model is learned by inductive logic programming (ILP) to capture logic rules of state transitions. By planning over the product of the symbolic transition model and the automaton derived from the LTL formula, the agent can resolve causal dependencies and break a causally complex problem down into a sequence of simpler lowlevel subtasks. We evaluate the proposed framework on three environments in both discrete and continuous domains, showing advantages over previous representative methods.more » « less

Although perception is an increasingly dominant portion of the overall computational cost for autonomous systems, only a fraction of the information perceived is likely to be relevant to the current task. To alleviate these perception costs, we develop a novel simultaneous perceptionâ€“action design framework wherein an agent senses only the taskrelevant information. This formulation differs from that of a partially observable Markov decision process, since the agent is free to synthesize not only its policy for action selection but also its beliefdependent observation function. The method enables the agent to balance its perception costs with those incurred by operating in its environment. To obtain a computationally tractable solution, we approximate the value function using a novel method of invariant finite belief sets, wherein the agent acts exclusively on a finite subset of the continuous belief space. We solve the approximate problem through value iteration in which a linear program is solved individually for each belief state in the set, in each iteration. Finally, we prove that the value functions, under an assumption on their structure, converge to their continuous statespace values as the sample density increases.more » « less