Our goal is to learn control policies for robots that provably generalize well to novel environments given a dataset of example environments. The key technical idea behind our approach is to leverage tools from generalization theory in machine learning by exploiting a precise analogy (which we present in the form of a reduction) between generalization of control policies to novel environments and generalization of hypotheses in the supervised learning setting. In particular, we utilize the probably approximately correct (PAC)Bayes framework, which allows us to obtain upper bounds that hold with high probability on the expected cost of (stochastic) control policies across novel environments. We propose policy learning algorithms that explicitly seek to minimize this upper bound. The corresponding optimization problem can be solved using convex optimization (relative entropy programming in particular) in the setting where we are optimizing over a finite policy space. In the more general setting of continuously parameterized policies (e.g., neural network policies), we minimize this upper bound using stochastic gradient descent. We present simulated results of our approach applied to learning (1) reactive obstacle avoidance policies and (2) neural networkbased grasping policies. We also present hardware results for the Parrot Swing drone navigating through different obstacle more »
 Publication Date:
 NSFPAR ID:
 10196346
 Journal Name:
 The International Journal of Robotics Research
 Volume:
 40
 Issue:
 23
 Page Range or eLocationID:
 p. 574593
 ISSN:
 02783649
 Publisher:
 SAGE Publications
 Sponsoring Org:
 National Science Foundation
More Like this

We consider the periodic review dynamic pricing and inventory control problem with fixed ordering cost. Demand is random and price dependent, and unsatisfied demand is backlogged. With complete demand information, the celebrated [Formula: see text] policy is proved to be optimal, where s and S are the reorder point and orderupto level for ordering strategy, and [Formula: see text], a function of onhand inventory level, characterizes the pricing strategy. In this paper, we consider incomplete demand information and develop online learning algorithms whose average profit approaches that of the optimal [Formula: see text] with a tight [Formula: see text] regretmore »

Wallach, H (Ed.)We study the problem of programmatic reinforcement learning, in which policies are represented as short programs in a symbolic language. Programmatic policies can be more interpretable, generalizable, and amenable to formal verification than neural policies; however, designing rigorous learning approaches for such policies remains a challenge. Our approach to this challengea metaalgorithm called PROPELis based on three insights. First, we view our learning task as optimization in policy space, modulo the constraint that the desired policy has a programmatic representation, and solve this optimization problem using a form of mirror descent that takes a gradient step into the unconstrained policymore »

We present a samplingbased framework for feed back motion planning of legged robots. Our framework is based on switching between limit cycles at a fixed instance of motion, the Poincare ́section(e.g.,apex or touchdown),by finding overlaps between the regions of attraction (ROA) of two limit cycles. First, we assume a candidate orbital Lyapunov function (OLF) and define a ROA at the Poincare ́ section. Next, we solve multiple trajectory optimization problems, one for each sampled initial condition on the ROA to minimize an energy metric and subject to the exponential convergence of the OLF between two steps. The result is amore »

We study multiagent reinforcement learning (MARL) in a stochastic network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, timeinvariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be nonlocal and stochastic, and provide a finitetime errormore »

This work concerns the asymptotic behavior of solutions to a (strictly) subcritical fluid model for a data communication network, where file sizes are generally distributed and the network operates under a fair bandwidthsharing policy. Here we consider fair bandwidthsharing policies that are a slight generalization of the [Formula: see text]fair policies introduced by Mo and Walrand [Mo J, Walrand J (2000) Fair endtoend windowbased congestion control. IEEE/ACM Trans. Networks 8(5):556–567.]. Since the year 2000, it has been a standing problem to prove stability of the data communications network model of Massoulié and Roberts [Massoulié L, Roberts J (2000) Bandwidth sharingmore »