In recent years, researchers have made significant progress in devising reinforcementlearning algorithms for optimizing linear temporal logic (LTL) objectives and LTLlike objectives.Despite these advancements, there are fundamental limitations to how well this problem can be solved. Previous studies have alluded to this fact but have not examined it in depth.In this paper, we address the tractability of reinforcement learning for general LTL objectives from a theoretical perspective.We formalize the problem under the probably approximately correct learning in Markov decision processes (PACMDP) framework, a standard framework for measuring sample complexity in reinforcement learning.In this formalization, we prove that the optimal policy for any LTL formula is PACMDPlearnable if and only if the formula is in the most limited class in the LTL hierarchy, consisting of formulas that are decidable within a finite horizon.Practically, our result implies that it is impossible for a reinforcementlearning algorithm to obtain a PACMDP guarantee on the performance of its learned policy after finitely many interactions with an unconstrained environment for LTL objectives that are not decidable within a finite horizon.
Computably Continuous ReinforcementLearning Objectives are PAClearnable
In reinforcement learning, the classic objectives of maximizing discounted
and finitehorizon cumulative rewards are PAClearnable: There are algorithms
that learn a nearoptimal policy with high probability using a finite amount of
samples and computation. In recent years, researchers have introduced
objectives and corresponding reinforcementlearning algorithms beyond the
classic cumulative rewards, such as objectives specified as linear temporal
logic formulas. However, questions about the PAClearnability of these new
objectives have remained open.
This work demonstrates the PAClearnability of general reinforcementlearning
objectives through sufficient conditions for PAClearnability in two analysis
settings. In particular, for the analysis that considers only sample
complexity, we prove that if an objective given as an oracle is uniformly
continuous, then it is PAClearnable. Further, for the analysis that considers
computational complexity, we prove that if an objective is computable, then it
is PAClearnable. In other words, if a procedure computes successive
approximations of the objective's value, then the objective is PAClearnable.
We give three applications of our condition on objectives from the literature
with previously unknown PAClearnability and prove that these objectives are
PAClearnable. Overall, our result helps verify existing objectives'
PAClearnability. Also, as some studied objectives that are not uniformly
continuous have been shown to be not PAClearnable, our results could guide the
design of new PAClearnable objectives.
more »
« less
 Award ID(s):
 1918839
 NSFPAR ID:
 10404354
 Date Published:
 Journal Name:
 arXivorg
 ISSN:
 23318422
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We consider the question of speeding up classic graph algorithms with machinelearned predictions. In this model, algorithms are furnished with extra advice learned from past or similar instances. Given the additional information, we aim to improve upon the traditional worstcase runtime guarantees. Our contributions are the following: (i) We give a faster algorithm for minimumweight bipartite matching via learned duals, improving the recent result by Dinitz, Im, Lavastida, Moseley and Vassilvitskii (NeurIPS, 2021); (ii) We extend the learned dual approach to the singlesource shortest path problem (with negative edge lengths), achieving an almost linear runtime given sufficiently accurate predictions which improves upon the classic fastest algorithm due to Goldberg (SIAM J. Comput., 1995); (iii) We provide a general reductionbased framework for learningbased graph algorithms, leading to new algorithms for degreeconstrained subgraph and minimumcost 01 flow, based on reductions to bipartite matching and the shortest path problem. Finally, we give a set of general learnability theorems, showing that the predictions required by our algorithms can be efficiently learned in a PAC fashionmore » « less

How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its “learning curve”, that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of VapnikChervonenkis and Valiant, does not explain the behavior of learning curves: the distributionfree PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy. In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios.more » « less

null (Ed.)Omegaregular properties—specified using linear time temporal logic or various forms of omegaautomata—find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for modelfree RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an Open image in new windowregular property to that for a simple reachability objective. A possible drawback of this translation is that reachability rewards are sparse, being reaped only at the end of each episode. Another approach reduces the search for an optimal policy to an optimization problem with two interdependent discount parameters. While this approach provides denser rewards than the reduction to reachability, it is not easily mapped to offtheshelf RL algorithms. We propose a reward scheme that reduces the search for an optimal policy to an optimization problem with a single discount parameter that produces dense rewards and is compatible with offtheshelf RL algorithms. Finally, we report an experimental comparison of these and other reward schemes for modelfree RL with omegaregular objectives.more » « less

Abstract About 25 years ago, it came to light that a single combinatorial property determines both an important dividing line in model theory (NIP) and machine learning (PAClearnability). The following years saw a fruitful exchange of ideas between PAClearning and the model theory of NIP structures. In this article, we point out a new and similar connection between model theory and machine learning, this time developing a correspondence between stability and learnability in various settings of online learning. In particular, this gives many new examples of mathematically interesting classes which are learnable in the online setting.more » « less