In recent years, researchers have made significant progress in devising reinforcementlearning algorithms for optimizing linear temporal logic (LTL) objectives and LTLlike objectives.Despite these advancements, there are fundamental limitations to how well this problem can be solved. Previous studies have alluded to this fact but have not examined it in depth.In this paper, we address the tractability of reinforcement learning for general LTL objectives from a theoretical perspective.We formalize the problem under the probably approximately correct learning in Markov decision processes (PACMDP) framework, a standard framework for measuring sample complexity in reinforcement learning.In this formalization, we prove that the optimal policy for any LTL formula is PACMDPlearnable if and only if the formula is in the most limited class in the LTL hierarchy, consisting of formulas that are decidable within a finite horizon.Practically, our result implies that it is impossible for a reinforcementlearning algorithm to obtain a PACMDP guarantee on the performance of its learned policy after finitely many interactions with an unconstrained environment for LTL objectives that are not decidable within a finite horizon.
more » « less Award ID(s):
 1918839
 NSFPAR ID:
 10404290
 Date Published:
 Journal Name:
 Proceedings of the ThirtyFirst International Joint Conference on Artificial Intelligence (IJCAI22)
 Page Range / eLocation ID:
 3650 to 3658
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

In reinforcement learning, the classic objectives of maximizing discounted and finitehorizon cumulative rewards are PAClearnable: There are algorithms that learn a nearoptimal policy with high probability using a finite amount of samples and computation. In recent years, researchers have introduced objectives and corresponding reinforcementlearning algorithms beyond the classic cumulative rewards, such as objectives specified as linear temporal logic formulas. However, questions about the PAClearnability of these new objectives have remained open. This work demonstrates the PAClearnability of general reinforcementlearning objectives through sufficient conditions for PAClearnability in two analysis settings. In particular, for the analysis that considers only sample complexity, we prove that if an objective given as an oracle is uniformly continuous, then it is PAClearnable. Further, for the analysis that considers computational complexity, we prove that if an objective is computable, then it is PAClearnable. In other words, if a procedure computes successive approximations of the objective's value, then the objective is PAClearnable. We give three applications of our condition on objectives from the literature with previously unknown PAClearnability and prove that these objectives are PAClearnable. Overall, our result helps verify existing objectives' PAClearnability. Also, as some studied objectives that are not uniformly continuous have been shown to be not PAClearnable, our results could guide the design of new PAClearnable objectives.more » « less

Many systems are naturally modeled as Markov Decision Processes (MDPs), combining probabilities and strategic actions. Given a model of a system as an MDP and some logical specification of system behavior, the goal of synthesis is to find a policy that maximizes the probability of achieving this behavior. A popular choice for defining behaviors is Linear Temporal Logic (LTL). Policy synthesis on MDPs for properties specified in LTL has been well studied. LTL, however, is defined over infinite traces, while many properties of interest are inherently finite. Linear Temporal Logic over finite traces (LTLf ) has been used to express such properties, but no tools exist to solve policy synthesis for MDP behaviors given finitetrace properties. We present two algorithms for solving this synthesis problem: the first via reduction of LTLf to LTL and the second using native tools for LTLf . We compare the scalability of these two approaches for synthesis and show that the native approach offers better scalability compared to existing automaton generation tools for LTL.more » « less

Abstract Supervised machine learning via artificial neural network (ANN) has gained significant popularity for many geomechanics applications that involves multi‐phase flow and poromechanics. For unsaturated poromechanics problems, the multi‐physics nature and the complexity of the hydraulic laws make it difficult to design the optimal setup, architecture, and hyper‐parameters of the deep neural networks. This paper presents a meta‐modeling approach that utilizes deep reinforcement learning (DRL) to automatically discover optimal neural network settings that maximize a pre‐defined performance metric for the machine learning constitutive laws. This meta‐modeling framework is cast as a Markov Decision Process (MDP) with well‐defined states (subsets of states representing the proposed neural network (NN) settings), actions, and rewards. Following the selection rules, the artificial intelligence (AI) agent, represented in DRL via NN, self‐learns from taking a sequence of actions and receiving feedback signals (rewards) within the selection environment. By utilizing the Monte Carlo Tree Search (MCTS) to update the policy/value networks, the AI agent replaces the human modeler to handle the otherwise time‐consuming trial‐and‐error process that leads to the optimized choices of setup from a high‐dimensional parametric space. This approach is applied to generate two key constitutive laws for the unsaturated poromechanics problems: (1) the path‐dependent retention curve with distinctive wetting and drying paths. (2) The flow in the micropores, governed by an anisotropic permeability tensor. Numerical experiments have shown that the resultant ML‐generated material models can be integrated into a finite element (FE) solver to solve initial‐boundary‐value problems as replacements of the hand‐craft constitutive laws.

In this work, we consider the popular treebased search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinitehorizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses “logarithmic” bonus term for balancing exploration and exploitation within the treebased search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a “policy improvement” operator; that is, it iteratively improves value function approximation for all states because of combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε approximation of the value function with respect to [Formula: see text] norm, MCTS combined with nearest neighbor requires a sample size scaling as [Formula: see text], where d is the dimension of the state space. This is nearly optimal because of a minimax lower bound of [Formula: see text], suggesting the strength of the variant of MCTS we propose here and our resulting analysis.more » « less

Integrating Symbolic Planning and Reinforcement Learning for Following Temporal Logic SpecificationsTeaching a deep reinforcement learning (RL) agent to follow instructions in multitask environments is a challenging problem. We consider that user defines every task by a linear temporal logic (LTL) formula. However, some causal dependencies in complex environments may be unknown to the user in advance. Hence, when human user is specifying instructions, the robot cannot solve the tasks by simply following the given instructions. In this work, we propose a hierarchical reinforcement learning (HRL) framework in which a symbolic transition model is learned to efficiently produce highlevel plans that can guide the agent efficiently solve different tasks. Specifically, the symbolic transition model is learned by inductive logic programming (ILP) to capture logic rules of state transitions. By planning over the product of the symbolic transition model and the automaton derived from the LTL formula, the agent can resolve causal dependencies and break a causally complex problem down into a sequence of simpler lowlevel subtasks. We evaluate the proposed framework on three environments in both discrete and continuous domains, showing advantages over previous representative methods.more » « less