skip to main content


This content will become publicly available on May 29, 2024

Title: Learning Exploration Strategies to Solve Real-World Marble Runs
Tasks involving locally unstable or discontinuous dynamics (such as bifurcations and collisions) remain challenging in robotics, because small variations in the environment can have a significant impact on task outcomes. For such tasks, learning a robust deterministic policy is difficult. We focus on structuring exploration with multiple stochastic policies based on a mixture of experts (MoE) policy representation that can be efficiently adapted. The MoE policy is composed of stochastic sub-policies that allow exploration of multiple distinct regions of the action space (or strategies) and a highlevel selection policy to guide exploration towards the most promising regions. We develop a robot system to evaluate our approach in a real-world physical problem solving domain. After training the MoE policy in simulation, online learning in the real world demonstrates efficient adaptation within just a few dozen attempts, with a minimal sim2real gap. Our results confirm that representing multiple strategies promotes efficient adaptation in new environments and strategies learned under different dynamics can still provide useful information about where to look for good strategies.  more » « less
Award ID(s):
1849287
NSF-PAR ID:
10496012
Author(s) / Creator(s):
;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
Proceedings IEEE International Conference on Robotics and Automation
ISSN:
1050-4729
Page Range / eLocation ID:
7243 to 7249
Format(s):
Medium: X
Location:
London, United Kingdom
Sponsoring Org:
National Science Foundation
More Like this
  1. We develop an approach to improve the learning capabilities of robotic systems by combining learned predictive models with experience-based state-action policy mappings. Predictive models provide an understanding of the task and the dynamics, while experience-based (model-free) policy mappings encode favorable actions that override planned actions. We refer to our approach of systematically combining model-based and model-free learning methods as hybrid learning. Our approach efficiently learns motor skills and improves the performance of predictive models and experience-based policies. Moreover, our approach enables policies (both model-based and model-free) to be updated using any off-policy reinforcement learning method. We derive a deterministic method of hybrid learning by optimally switching between learning modalities. We adapt our method to a stochastic variation that relaxes some of the key assumptions in the original derivation. Our deterministic and stochastic variations are tested on a variety of robot control benchmark tasks in simulation as well as a hardware manipulation task. We extend our approach for use with imitation learning methods, where experience is provided through demonstrations, and we test the expanded capability with a real-world pick-and-place task. The results show that our method is capable of improving the performance and sample efficiency of learning motor skills in a variety of experimental domains. 
    more » « less
  2. null (Ed.)
    Humans leverage the dynamics of the environment and their own bodies to accomplish challenging tasks such as grasping an object while walking past it or pushing off a wall to turn a corner. Such tasks often involve switching dynamics as the robot makes and breaks contact. Learning these dynamics is a challenging problem and prone to model inaccuracies, especially near contact regions. In this work, we present a framework for learning composite dynamical behaviors from expert demonstrations. We learn a switching linear dynamical model with contacts encoded in switching conditions as a close approximation of our system dynamics. We then use discrete-time LQR as the differentiable policy class for data-efficient learning of control to develop a control strategy that operates over multiple dynamical modes and takes into account discontinuities due to contact. In addition to predicting interactions with the environment, our policy effectively reacts to inaccurate predictions such as unanticipated contacts. Through simulation and real world experiments, we demonstrate generalization of learned behaviors to different scenarios and robustness to model inaccuracies during execution. 
    more » « less
  3. We present a strategy for simulation-to-real transfer, which builds on recent advances in robot skill decomposition. Rather than focusing on minimizing the simulation–reality gap, we propose a method for increasing the sample efficiency and robustness of existing simulation-to-real approaches which exploits hierarchy and online adaptation. Instead of learning a unique policy for each desired robotic task, we learn a diverse set of skills and their variations, and embed those skill variations in a continuously parameterized space. We then interpolate, search, and plan in this space to find a transferable policy which solves more complex, high-level tasks by combining low-level skills and their variations. In this work, we first characterize the behavior of this learned skill space, by experimenting with several techniques for composing pre-learned latent skills. We then discuss an algorithm which allows our method to perform long-horizon tasks never seen in simulation, by intelligently sequencing short-horizon latent skills. Our algorithm adapts to unseen tasks online by repeatedly choosing new skills from the latent space, using live sensor data and simulation to predict which latent skill will perform best next in the real world. Importantly, our method learns to control a real robot in joint-space to achieve these high-level tasks with little or no on-robot time, despite the fact that the low-level policies may not be perfectly transferable from simulation to real, and that the low-level skills were not trained on any examples of high-level tasks. In addition to our results indicating a lower sample complexity for families of tasks, we believe that our method provides a promising template for combining learning-based methods with proven classical robotics algorithms such as model-predictive control.

     
    more » « less
  4. Training robotic policies in simulation suffers from the sim-to-real gap, as simulated dynamics can be different from real-world dynamics. Past works tackled this problem through domain randomization and online system-identification. The former is sensitive to the manually-specified training distribution of dynamics parameters and can result in behaviors that are overly conservative. The latter requires learning policies that concurrently perform the task and generate useful trajectories for system identification. In this work, we propose and analyze a framework for learning exploration policies that explicitly perform task-oriented exploration actions to identify task-relevant system parameters. These parameters are then used by model-based trajectory optimization algorithms to perform the task in the real world. We instantiate the framework in simulation with the Linear Quadratic Regulator as well as in the real world with pouring and object dragging tasks. Experiments show that task-oriented exploration helps model-based policies adapt to systems with initially unknown parameters, and it leads to better task performance than task-agnostic exploration. 
    more » « less
  5. Access control is often reported to be “profoundly broken” in real- world practices due to prevalent policy misconfigurations intro- duced by system administrators (sysadmins). Given the dynamics of resource and data sharing, access control policies need to be continuously updated. Unfortunately, to err is human—sysadmins often make mistakes such as over-granting privileges when chang- ing access control policies. With today’s limited tooling support for continuous validation, such mistakes can stay unnoticed for a long time until eventually being exploited by attackers, causing catastrophic security incidents. We present P-DIFF, a practical tool for monitoring access control behavior to help sysadmins early detect unintended access control policy changes and perform postmortem forensic analysis upon security attacks. P-DIFF continuously monitors access logs and infers access control policies from them. To handle the challenge of policy evolution, we devise a novel time-changing decision tree to effectively represent access control policy changes, coupled with a new learning algorithm to infer the tree from access logs. P-DIFF provides sysadmins with the inferred policies and detected changes to assist the following two tasks: (1) validating whether the access control changes are intended or not; (2) pinpointing the historical changes responsible for a given security attack. We evaluate P-DIFF with a variety of datasets collected from five real-world systems, including two from industrial companies. P- DIFF can detect 86%–100% of access control policy changes with an average precision of 89%. For forensic analysis, P-DIFF can pinpoint the root-cause change that permits the target access in 85%–98% of the evaluated cases. 
    more » « less