skip to main content


Title: Actor-Critic PAC Robust Policy Search
This work studies an approach for computing provably robust control laws for robotic systems operating in uncertain environments. We develop an actor-critic style policy search algorithm based on the idea of minimizing an upper confidence bound on the negative expected advantage of a control policy at each policy update iteration. This new algorithm is a reformulation of Probably-Approximately-Correct Robust Policy Search (PROPS) and, unlike PROPS, allows for both step-based evaluation and step-based sampling strategies in policy parameter space, enabled by the use of Generalized Advantage Estimation and Generalized Exploration. As a result, the new algorithm is more data efficient and is expected to compute higher quality policies faster. We empirically evaluate the algorithm in simulation on a challenging robot navigation task using a high-fidelity deep stochastic model of an agile ground vehicle and compare its performance to the original trajectory-based PROPS  more » « less
Award ID(s):
1637949
NSF-PAR ID:
10136847
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ICRA 2019
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The increasing reliance on robust data-driven decision-making across many domains has made it necessary for data management systems to manage many thousands to millions of versions of datasets, acquired or constructed at various stages of analysis pipelines over time. Delta encoding is an effective and widely-used solution to compactly store a large number of datasets, that simultaneously exploits redundancies across them and keeps the average retrieval cost of reconstructing any dataset low. However, supporting any kind of rich retrieval or querying functionality, beyond single dataset checkout, is challenging in such storage engines. In this paper, we initiate a systematic study of this problem, and present DEX, a novel stand-alone delta-oriented execution engine, whose goal is to take advantage of the already computed deltas between the datasets for efficient query processing. In this work, we study how to execute checkout, intersection, union and t-threshold queries over record-based files; we show that processing of even these basic queries leads to many new and unexplored challenges and trade-offs. Starting from a query plan that confines query execution to a small set of deltas, we introduce new transformation rules based on the algebraic properties of the deltas, that allow us to explore the search space of alternative plans. For the case of checkout, we present a dynamic programming algorithm to efficiently select the optimal query plan under our cost model, while we design efficient heuristics to select effective plans that vastly outperform the base checkout-then-query approach for other queries. A key characteristic of our query execution methods is that the computational cost is primarily dependent on the size and the number of deltas in the expression (typically small), and not the input dataset versions (which can be very large). We have implemented DEX prototype on top of git, a widely used version control system. We present an extensive experimental evaluation on synthetic data with diverse characteristics, that shows that our methods perform exceedingly well compared to the baseline. 
    more » « less
  2. Prior work on automatic control synthesis for cyberphysical systems under logical constraints has primarily focused on environmental disturbances or modeling uncertainties, however, the impact of deliberate and malicious attacks has been less studied. In this paper, we consider a discrete-time dynamical system with a linear temporal logic (LTL) constraint in the presence of an adversary, which is modeled as a stochastic game. We assume that the adversary observes the control policy before choosing an attack strategy. We investigate two problems. In the first problem, we synthesize a robust control policy for the stochastic game that maximizes the probability of satisfying the LTL constraint. A value iteration based algorithm is proposed to compute the optimal control policy. In the second problem, we focus on a subclass of LTL constraints, which consist of an arbitrary LTL formula and an invariant constraint. We then investigate the problem of computing a control policy that minimizes the expected number of invariant constraint violations while maximizing the probability of satisfying the arbitrary LTL constraint. We characterize the optimality condition for the desired control policy. A policy iteration based algorithm is proposed to compute the control policy. We illustrate the proposed approaches using two numerical case studies. 
    more » « less
  3. Policy gradient methods have become popular in multi-agent reinforcement learning, but they suffer from high variance due to the presence of environmental stochasticity and exploring agents (i.e., non-stationarity), which is potentially worsened by the difficulty in credit assignment. As a result, there is a need for a method that is not only capable of efficiently solving the above two problems but also robust enough to solve a variety of tasks. To this end, we propose a new multi-agent policy gradient method, called Robust Local Advantage (ROLA) Actor-Critic. ROLA allows each agent to learn an individual action-value function as a local critic as well as ameliorating environment non-stationarity via a novel centralized training approach based on a centralized critic. By using this local critic, each agent calculates a baseline to reduce variance on its policy gradient estimation, which results in an expected advantage action-value over other agents’ choices that implicitly improves credit assignment. We evaluate ROLA across diverse benchmarks and show its robustness and effectiveness over a number of state-of-the-art multi-agent policy gradient algorithms. 
    more » « less
  4. N. Matni, M. Morari (Ed.)
    In this paper, we propose a robust reinforcement learning method for a class of linear discrete-time systems to handle model mismatches that may be induced by sim-to-real gap. Under the formulation of risk-sensitive linear quadratic Gaussian control, a dual-loop policy optimization algorithm is proposed to iteratively approximate the robust and optimal controller. The convergence and robustness of the dual-loop policy optimization algorithm are rigorously analyzed. It is shown that the dual-loop policy optimization algorithm uniformly converges to the optimal solution. In addition, by invoking the concept of small-disturbance input-to-state stability, it is guaranteed that the dual-loop policy optimization algorithm still converges to a neighborhood of the optimal solution when the algorithm is subject to a sufficiently small disturbance at each step. When the system matrices are unknown, a learning-based off-policy policy optimization algorithm is proposed for the same class of linear systems with additive Gaussian noise. The numerical simulation is implemented to demonstrate the efficacy of the proposed algorithm. 
    more » « less
  5. Mining algorithms for relationship-based access control policies produce policies composed of relationship-based patterns that justify the input authorizations according to a given system graph. The correct functioning of a policy mining algorithm is typically tested based on experimental evaluations, in each of which the miner is presented with a set of authorizations and a system graph, and is expected to produce the corresponding ground truth policy. In this paper, we propose formal properties that must exist between the system graph and the ground truth policy in an evaluation test so that the miner is challenged to produce the exact ground truth policy. We show that failure to verify these properties in the experiment leads to inadequate evaluation, i.e., not truly testing whether the miner can handle the complexity of the ground truth policy. We also argue that following these properties would provide a computational advantage in the evaluations. We propose algorithms to identify and correct violations of these properties in system graphs. We also present our observations regarding these properties and their enforcement using a set of experimental studies. 
    more » « less