skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Revisiting Design Choices in Proximal Policy Optimization
Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian distributions or discrete Softmax distributions. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks. We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO. We explain why standard design choices are problematic in these cases, and show that alternative choices of surrogate objectives and policy parameterizations can prevent the failure modes. We hope that our work serves as a reminder that many algorithmic design choices in reinforcement learning are tied to specific simulation environments. We should not implicitly accept these choices as a standard part of a more general algorithm.  more » « less
Award ID(s):
1764033
PAR ID:
10249249
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
ArXivorg
Volume:
arXiv:2009.10897
ISSN:
2331-8422
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Actor-critic methods, like Twin Delayed Deep Deterministic Policy Gradient (TD3), depend on basic noise-based exploration, which can result in less than optimal policy convergence. In this study, we introduce Monte Carlo Beam Search (MCBS), a new hybrid method that combines beam search and Monte Carlo rollouts with TD3 to improve exploration and action selection. MCBS produces several candidate actions around the policy's output and assesses them through short-horizon rollouts, enabling the agent to make better-informed choices. We test MCBS across various continuous-control benchmarks, including HalfCheetah-v4, Walker2d-v5, and Swimmer-v5, showing enhanced sample efficiency and performance compared to standard TD3 and other baseline methods like SAC, PPO, and A2C. Our findings emphasize MCBS's capability to enhance policy learning through structured look-ahead search while ensuring computational efficiency. Additionally, we offer a detailed analysis of crucial hyperparameters, such as beam width and rollout depth, and explore adaptive strategies to optimize MCBS for complex control tasks. Our method shows a higher convergence rate across different environments compared to TD3, SAC, PPO, and A2C. For instance, we achieved 90% of the maximum achievable reward within around 200 thousand timesteps compared to 400 thousand timesteps for the second-best method. 
    more » « less
  2. Landing a multi-rotor uncrewed aerial vehicle (UAV) on a moving target in the presence of partial observability, due to factors such as sensor failure or noise, represents an outstanding challenge that requires integrative techniques in robotics and machine learning. In this paper, we propose embedding a long short-term memory (LSTM) network into a variation of proximal policy optimization (PPO) architecture, termed robust policy optimization (RPO), to address this issue. The proposed algorithm is a deep reinforcement learning approach that utilizes recurrent neural networks (RNNs) as a memory component. Leveraging the end-to-end learning capability of deep reinforcement learning, the RPO-LSTM algorithm learns the optimal control policy without the need for feature engineering. Through a series of simulation-based studies, we demonstrate the superior effectiveness and practicality of our approach compared to the state-of-the-art proximal policy optimization (PPO) and the classical control method Lee-EKF, particularly in scenarios with partial observability. The empirical results reveal that RPO-LSTM significantly outperforms competing reinforcement learning algorithms, achieving up to 74% more successful landings than Lee-EKF and 50% more than PPO in flicker scenarios, maintaining robust performance in noisy environments and in the most challenging conditions that combine flicker and noise. These findings underscore the potential of RPO-LSTM in solving the problem of UAV landing on moving targets amid various degrees of sensor impairment and environmental interference. 
    more » « less
  3. III, Hal Daumé; Singh, Aarti (Ed.)
    Studying the set of exact solutions of a system of polynomial equations largely depends on a single iterative algorithm, known as Buchberger’s algorithm. Optimized versions of this algorithm are crucial for many computer algebra systems (e.g., Mathematica, Maple, Sage). We introduce a new approach to Buchberger’s algorithm that uses reinforcement learning agents to perform S-pair selection, a key step in the algorithm. We then study how the difficulty of the problem depends on the choices of domain and distribution of polynomials, about which little is known. Finally, we train a policy model using proximal policy optimization (PPO) to learn S-pair selection strategies for random systems of binomial equations. In certain domains, the trained model outperforms state-of-the-art selection heuristics in total number of polynomial additions performed, which provides a proof-of-concept that recent developments in machine learning have the potential to improve performance of algorithms in symbolic computation. 
    more » « less
  4. Serverless Function-As-A-Service (FaaS) is an emerging cloud computing paradigm that frees application developers from infrastructure management tasks such as resource provisioning and scaling. To reduce the tail latency of functions and improve resource utilization, recent research has been focused on applying online learning algorithms such as reinforcement learning (RL) to manage resources. Compared to existing heuristics-based resource management approaches, RL-based approaches eliminate humans in the loop and avoid the painstaking generation of heuristics. In this paper, we show that the state-of-The-Art single-Agent RL algorithm (S-RL) suffers up to 4.6x higher function tail latency degradation on multi-Tenant serverless FaaS platforms and is unable to converge during training. We then propose and implement a customized multi-Agent RL algorithm based on Proximal Policy Optimization, i.e., multi-Agent PPO (MA-PPO). We show that in multi-Tenant environments, MA-PPO enables each agent to be trained until convergence and provides online performance comparable to S-RL in single-Tenant cases with less than 10% degradation. Besides, MA-PPO provides a 4.4x improvement in S-RL performance (in terms of function tail latency) in multi-Tenant cases. 
    more » « less
  5. The actual failure times of individual components are usually unavailable in many applications. Instead, only aggregate failure-time data are collected by actual users, due to technical and/or economic reasons. When dealing with such data for reliability estimation, practitioners often face the challenges of selecting the underlying failure-time distributions and the corresponding statistical inference methods. So far, only the exponential, normal, gamma and inverse Gaussian distributions have been used in analyzing aggregate failure-time data, due to these distributions having closed-form expressions for such data. However, the limited choices of probability distributions cannot satisfy extensive needs in a variety of engineering applications. PHase-type (PH) distributions are robust and flexible in modeling failure-time data, as they can mimic a large collection of probability distributions of non-negative random variables arbitrarily closely by adjusting the model structures. In this article, PH distributions are utilized, for the first time, in reliability estimation based on aggregate failure-time data. A Maximum Likelihood Estimation (MLE) method and a Bayesian alternative are developed. For the MLE method, an Expectation-Maximization algorithm is developed for parameter estimation, and the corresponding Fisher information is used to construct the confidence intervals for the quantities of interest. For the Bayesian method, a procedure for performing point and interval estimation is also introduced. Numerical examples show that the proposed PH-based reliability estimation methods are quite flexible and alleviate the burden of selecting a probability distribution when the underlying failure-time distribution is general or even unknown. 
    more » « less