skip to main content

Title: PAC-Bayes control: learning policies that provably generalize to novel environments

Our goal is to learn control policies for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between generalization of control policies to novel environments and generalization of hypotheses in the supervised learning setting. In particular, we utilize the probably approximately correct (PAC)-Bayes framework, which allows us to obtain upper bounds that hold with high probability on the expected cost of (stochastic) control policies across novel environments. We propose policy learning algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (relative entropy programming in particular) in the setting where we are optimizing over a finite policy space. In the more general setting of continuously parameterized policies (e.g., neural network policies), we minimize this upper bound using stochastic gradient descent. We present simulated results of our approach applied to learning (1) reactive obstacle avoidance policies and (2) neural network-based grasping policies. We also present hardware results for the Parrot Swing drone navigating through different obstacle more » environments. Our examples demonstrate the potential of our approach to provide strong generalization guarantees for robotic systems with continuous state and action spaces, complicated (e.g., nonlinear) dynamics, rich sensory inputs (e.g., depth images), and neural network-based policies.

« less
Authors:
 ;  ;  
Publication Date:
NSF-PAR ID:
10196346
Journal Name:
The International Journal of Robotics Research
Volume:
40
Issue:
2-3
Page Range or eLocation-ID:
p. 574-593
ISSN:
0278-3649
Publisher:
SAGE Publications
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the periodic review dynamic pricing and inventory control problem with fixed ordering cost. Demand is random and price dependent, and unsatisfied demand is backlogged. With complete demand information, the celebrated [Formula: see text] policy is proved to be optimal, where s and S are the reorder point and order-up-to level for ordering strategy, and [Formula: see text], a function of on-hand inventory level, characterizes the pricing strategy. In this paper, we consider incomplete demand information and develop online learning algorithms whose average profit approaches that of the optimal [Formula: see text] with a tight [Formula: see text] regretmore »rate. A number of salient features differentiate our work from the existing online learning researches in the operations management (OM) literature. First, computing the optimal [Formula: see text] policy requires solving a dynamic programming (DP) over multiple periods involving unknown quantities, which is different from the majority of learning problems in OM that only require solving single-period optimization questions. It is hence challenging to establish stability results through DP recursions, which we accomplish by proving uniform convergence of the profit-to-go function. The necessity of analyzing action-dependent state transition over multiple periods resembles the reinforcement learning question, considerably more difficult than existing bandit learning algorithms. Second, the pricing function [Formula: see text] is of infinite dimension, and approaching it is much more challenging than approaching a finite number of parameters as seen in existing researches. The demand-price relationship is estimated based on upper confidence bound, but the confidence interval cannot be explicitly calculated due to the complexity of the DP recursion. Finally, because of the multiperiod nature of [Formula: see text] policies the actual distribution of the randomness in demand plays an important role in determining the optimal pricing strategy [Formula: see text], which is unknown to the learner a priori. In this paper, the demand randomness is approximated by an empirical distribution constructed using dependent samples, and a novel Wasserstein metric-based argument is employed to prove convergence of the empirical distribution. This paper was accepted by J. George Shanthikumar, big data analytics.« less
  2. Wallach, H (Ed.)
    We study the problem of programmatic reinforcement learning, in which policies are represented as short programs in a symbolic language. Programmatic policies can be more interpretable, generalizable, and amenable to formal verification than neural policies; however, designing rigorous learning approaches for such policies remains a challenge. Our approach to this challenge-a meta-algorithm called PROPEL-is based on three insights. First, we view our learning task as optimization in policy space, modulo the constraint that the desired policy has a programmatic representation, and solve this optimization problem using a form of mirror descent that takes a gradient step into the unconstrained policymore »space and then projects back onto the constrained space. Second, we view the unconstrained policy space as mixing neural and programmatic representations, which enables employing state-of-the-art deep policy gradient approaches. Third, we cast the projection step as program synthesis via imitation learning, and exploit contemporary combinatorial methods for this task. We present theoretical convergence results for PROPEL and empirically evaluate the approach in three continuous control domains. The experiments show that PROPEL can significantly outperform state-of-the-art approaches for learning programmatic policies.« less
  3. We present a sampling-based framework for feed- back motion planning of legged robots. Our framework is based on switching between limit cycles at a fixed instance of motion, the Poincare ́section(e.g.,apex or touchdown),by finding overlaps between the regions of attraction (ROA) of two limit cycles. First, we assume a candidate orbital Lyapunov function (OLF) and define a ROA at the Poincare ́ section. Next, we solve multiple trajectory optimization problems, one for each sampled initial condition on the ROA to minimize an energy metric and subject to the exponential convergence of the OLF between two steps. The result is amore »table of control actions and the corresponding initial conditions at the Poincare ́ section. Then we develop a control policy for each control action as a function of the initial condition using deep learning neural networks. The control policy is validated by testing on initial conditions sampled on ROA of randomly chosen limit cycles. Finally, the rapidly-exploring random tree algorithm is adopted to plan transitions between the limit cycles using the ROAs. The approach is demonstrated on a hopper model to achieve velocity and height transitions between steps.« less
  4. We study multi-agent reinforcement learning (MARL) in a stochastic network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, time-invariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be non-local and stochastic, and provide a finite-time errormore »bound that shows how the convergence rate depends on the speed of information spread in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation, which apply beyond the setting of MARL in networked systems.« less
  5. This work concerns the asymptotic behavior of solutions to a (strictly) subcritical fluid model for a data communication network, where file sizes are generally distributed and the network operates under a fair bandwidth-sharing policy. Here we consider fair bandwidth-sharing policies that are a slight generalization of the [Formula: see text]-fair policies introduced by Mo and Walrand [Mo J, Walrand J (2000) Fair end-to-end window-based congestion control. IEEE/ACM Trans. Networks 8(5):556–567.]. Since the year 2000, it has been a standing problem to prove stability of the data communications network model of Massoulié and Roberts [Massoulié L, Roberts J (2000) Bandwidth sharingmore »and admission control for elastic traffic. Telecommunication Systems 15(1):185–201.], with general file sizes and operating under fair bandwidth sharing policies, when the offered load is less than capacity (subcritical conditions). A crucial step in an approach to this problem is to prove stability of subcritical fluid model solutions. In 2012, Paganini et al. [Paganini F, Tang A, Ferragut A, Andrew LLH (2012) Network stability under alpha fair bandwidth allocation with general file size distribution. IEEE Trans. Automatic Control 57(3):579–591.] introduced a Lyapunov function for this purpose and gave an argument, assuming that fluid model solutions are sufficiently smooth in time and space that they are strong solutions of a partial differential equation and assuming that no fluid level on any route touches zero before all route levels reach zero. The aim of the current paper is to prove stability of the subcritical fluid model without these strong assumptions. Starting with a slight generalization of the Lyapunov function proposed by Paganini et al., assuming that each component of the initial state of a measure-valued fluid model solution, as well as the file size distributions, have no atoms and have finite first moments, we prove absolute continuity in time of the composition of the Lyapunov function with any subcritical fluid model solution and describe the associated density. We use this to prove that the Lyapunov function composed with such a subcritical fluid model solution converges to zero as time goes to infinity. This implies that each component of the measure-valued fluid model solution converges vaguely on [Formula: see text] to the zero measure as time goes to infinity. Under the further assumption that the file size distributions have finite pth moments for some p > 1 and that each component of the initial state of the fluid model solution has finite pth moment, it is proved that the fluid model solution reaches the measure with all components equal to the zero measure in finite time and that the time to reach this zero state has a uniform bound for all fluid model solutions having a uniform bound on the initial total mass and the pth moment of each component of the initial state. In contrast to the analysis of Paganini et al., we do not need their strong smoothness assumptions on fluid model solutions and we rigorously treat the realistic, but singular situation, where the fluid level on some routes becomes zero, whereas other route levels remain positive.« less