skip to main content

Title: Actor-Critic PAC Robust Policy Search
This work studies an approach for computing provably robust control laws for robotic systems operating in uncertain environments. We develop an actor-critic style policy search algorithm based on the idea of minimizing an upper confidence bound on the negative expected advantage of a control policy at each policy update iteration. This new algorithm is a reformulation of Probably-Approximately-Correct Robust Policy Search (PROPS) and, unlike PROPS, allows for both step-based evaluation and step-based sampling strategies in policy parameter space, enabled by the use of Generalized Advantage Estimation and Generalized Exploration. As a result, the new algorithm is more data efficient and is expected to compute higher quality policies faster. We empirically evaluate the algorithm in simulation on a challenging robot navigation task using a high-fidelity deep stochastic model of an agile ground vehicle and compare its performance to the original trajectory-based PROPS
; ; ;
Award ID(s):
Publication Date:
Journal Name:
ICRA 2019
Sponsoring Org:
National Science Foundation
More Like this
  1. The increasing reliance on robust data-driven decision-making across many domains has made it necessary for data management systems to manage many thousands to millions of versions of datasets, acquired or constructed at various stages of analysis pipelines over time. Delta encoding is an effective and widely-used solution to compactly store a large number of datasets, that simultaneously exploits redundancies across them and keeps the average retrieval cost of reconstructing any dataset low. However, supporting any kind of rich retrieval or querying functionality, beyond single dataset checkout, is challenging in such storage engines. In this paper, we initiate a systematic study of this problem, and present DEX, a novel stand-alone delta-oriented execution engine, whose goal is to take advantage of the already computed deltas between the datasets for efficient query processing. In this work, we study how to execute checkout, intersection, union and t-threshold queries over record-based files; we show that processing of even these basic queries leads to many new and unexplored challenges and trade-offs. Starting from a query plan that confines query execution to a small set of deltas, we introduce new transformation rules based on the algebraic properties of the deltas, that allow us to explore the searchmore »space of alternative plans. For the case of checkout, we present a dynamic programming algorithm to efficiently select the optimal query plan under our cost model, while we design efficient heuristics to select effective plans that vastly outperform the base checkout-then-query approach for other queries. A key characteristic of our query execution methods is that the computational cost is primarily dependent on the size and the number of deltas in the expression (typically small), and not the input dataset versions (which can be very large). We have implemented DEX prototype on top of git, a widely used version control system. We present an extensive experimental evaluation on synthetic data with diverse characteristics, that shows that our methods perform exceedingly well compared to the baseline.« less
  2. Prior work on automatic control synthesis for cyberphysical systems under logical constraints has primarily focused on environmental disturbances or modeling uncertainties, however, the impact of deliberate and malicious attacks has been less studied. In this paper, we consider a discrete-time dynamical system with a linear temporal logic (LTL) constraint in the presence of an adversary, which is modeled as a stochastic game. We assume that the adversary observes the control policy before choosing an attack strategy. We investigate two problems. In the first problem, we synthesize a robust control policy for the stochastic game that maximizes the probability of satisfying the LTL constraint. A value iteration based algorithm is proposed to compute the optimal control policy. In the second problem, we focus on a subclass of LTL constraints, which consist of an arbitrary LTL formula and an invariant constraint. We then investigate the problem of computing a control policy that minimizes the expected number of invariant constraint violations while maximizing the probability of satisfying the arbitrary LTL constraint. We characterize the optimality condition for the desired control policy. A policy iteration based algorithm is proposed to compute the control policy. We illustrate the proposed approaches using two numerical case studies.
  3. Policy gradient methods have become popular in multi-agent reinforcement learning, but they suffer from high variance due to the presence of environmental stochasticity and exploring agents (i.e., non-stationarity), which is potentially worsened by the difficulty in credit assignment. As a result, there is a need for a method that is not only capable of efficiently solving the above two problems but also robust enough to solve a variety of tasks. To this end, we propose a new multi-agent policy gradient method, called Robust Local Advantage (ROLA) Actor-Critic. ROLA allows each agent to learn an individual action-value function as a local critic as well as ameliorating environment non-stationarity via a novel centralized training approach based on a centralized critic. By using this local critic, each agent calculates a baseline to reduce variance on its policy gradient estimation, which results in an expected advantage action-value over other agents’ choices that implicitly improves credit assignment. We evaluate ROLA across diverse benchmarks and show its robustness and effectiveness over a number of state-of-the-art multi-agent policy gradient algorithms.
  4. Mining algorithms for relationship-based access control policies produce policies composed of relationship-based patterns that justify the input authorizations according to a given system graph. The correct functioning of a policy mining algorithm is typically tested based on experimental evaluations, in each of which the miner is presented with a set of authorizations and a system graph, and is expected to produce the corresponding ground truth policy. In this paper, we propose formal properties that must exist between the system graph and the ground truth policy in an evaluation test so that the miner is challenged to produce the exact ground truth policy. We show that failure to verify these properties in the experiment leads to inadequate evaluation, i.e., not truly testing whether the miner can handle the complexity of the ground truth policy. We also argue that following these properties would provide a computational advantage in the evaluations. We propose algorithms to identify and correct violations of these properties in system graphs. We also present our observations regarding these properties and their enforcement using a set of experimental studies.
  5. Purpose

    This paper presents a method to search for the worst‐case configuration leading to the highest RF exposure for a multiconfiguration implantable fixation system under MRI.


    A two‐step method combining an artificial neural network and a genetic algorithm is developed to achieve this purpose. In the first step, the level of RF exposure in terms of peak 1‐g and/or 10‐g averaged specific absorption rate (SAR1g/10g), related to the multiconfiguration system, is predicted using an artificial neural network. A genetic algorithm is then used to search for the worst‐case configuration of this multidimensional nonlinear problem within both the enumerated discrete sample space and generalized continuous sample space. As an example, a generic plate system with a total of 576 configurations is used for both 1.5T and 3T MRI systems.


    The presented method can effectively identify the worst‐case configuration and accurately predict the SAR1g/10gwith no more than 20% of the samples in the studied discrete sample space, and can even predict the worst case in the generalized continuous sample space. The worst‐case prediction error in the generalized continuous sample space is less than 1.6% for SAR1gand less than 1.3% for SAR10gcompared with the simulation results.


    The combination of an artificial neural network withmore »genetic algorithm is a robust technique to determine the worst‐case RF exposure level for a multiconfiguration system, and only needs a small amount of training data from the entire system.

    « less