skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Robust Strategy for UAV Autonomous Landing on a Moving Platform under Partial Observability
Landing a multi-rotor uncrewed aerial vehicle (UAV) on a moving target in the presence of partial observability, due to factors such as sensor failure or noise, represents an outstanding challenge that requires integrative techniques in robotics and machine learning. In this paper, we propose embedding a long short-term memory (LSTM) network into a variation of proximal policy optimization (PPO) architecture, termed robust policy optimization (RPO), to address this issue. The proposed algorithm is a deep reinforcement learning approach that utilizes recurrent neural networks (RNNs) as a memory component. Leveraging the end-to-end learning capability of deep reinforcement learning, the RPO-LSTM algorithm learns the optimal control policy without the need for feature engineering. Through a series of simulation-based studies, we demonstrate the superior effectiveness and practicality of our approach compared to the state-of-the-art proximal policy optimization (PPO) and the classical control method Lee-EKF, particularly in scenarios with partial observability. The empirical results reveal that RPO-LSTM significantly outperforms competing reinforcement learning algorithms, achieving up to 74% more successful landings than Lee-EKF and 50% more than PPO in flicker scenarios, maintaining robust performance in noisy environments and in the most challenging conditions that combine flicker and noise. These findings underscore the potential of RPO-LSTM in solving the problem of UAV landing on moving targets amid various degrees of sensor impairment and environmental interference.  more » « less
Award ID(s):
2138206
PAR ID:
10586339
Author(s) / Creator(s):
; ;
Publisher / Repository:
MDPI
Date Published:
Journal Name:
Drones
Volume:
8
Issue:
6
ISSN:
2504-446X
Page Range / eLocation ID:
232
Subject(s) / Keyword(s):
deep reinforcement learning unmanned aerial vehicles partial observability model-free control
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In urban environments, tall buildings or structures can pose limits on the direct channel link between a base station (BS) and an Internet-of-Thing device (IoTD) for wireless communication. Unmanned aerial vehicles (UAVs) with a mounted reconfigurable intelligent surface (RIS), denoted as UAV-RIS, have been introduced in recent works to enhance the system throughput capacity by acting as a relay node between the BS and the IoTDs in wireless access networks. Uncoordinated UAVs or RIS phase shift elements will make unnecessary adjustments that can significantly impact the signal transmission to IoTDs in the area. The concept of age of information (AoI) is proposed in wireless network research to categorize the freshness of the received update message. To minimize the average sum of AoI (ASoA) in the network, two model-free deep reinforcement learning (DRL) approaches – Off-Policy Deep Q-Network (DQN) and On-Policy Proximal Policy Optimization (PPO) – are developed to solve the problem by jointly optimizing the RIS phase shift, the location of the UAV-RIS, and the IoTD transmission scheduling for large-scale IoT wireless networks. Analysis of loss functions and extensive simulations is performed to compare the stability and convergence performance of the two algorithms. The results reveal the superiority of the On-Policy approach, PPO, over the Off-Policy approach, DQN, in terms of stability, convergence speed, and under diverse environment settings 
    more » « less
  2. null (Ed.)
    Proximal Policy Optimization (PPO) is a popular deep policy gradient algorithm. In standard implementations, PPO regularizes policy updates with clipped probability ratios, and parameterizes policies with either continuous Gaussian distributions or discrete Softmax distributions. These design choices are widely accepted, and motivated by empirical performance comparisons on MuJoCo and Atari benchmarks. We revisit these practices outside the regime of current benchmarks, and expose three failure modes of standard PPO. We explain why standard design choices are problematic in these cases, and show that alternative choices of surrogate objectives and policy parameterizations can prevent the failure modes. We hope that our work serves as a reminder that many algorithmic design choices in reinforcement learning are tied to specific simulation environments. We should not implicitly accept these choices as a standard part of a more general algorithm. 
    more » « less
  3. Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a \emph{state expert}) during training to improve performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability. 
    more » « less
  4. Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a \emph{state expert}) during training to improve performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability. 
    more » « less
  5. Billard, A.; Asfour, T.; Khatib, O. (Ed.)
    Underwater navigation presents several challenges, including unstructured unknown environments, lack of reliable localization systems (e.g., GPS), and poor visibility. Furthermore, good-quality obstacle detection sensors for underwater robots are scant and costly; and many sensors like RGB-D cameras and LiDAR only work in-air. To enable reliable mapless underwater navigation despite these challenges, we propose a low-cost end-to-end navigation system, based on a monocular camera and a fixed single-beam echo-sounder, that efficiently navigates an underwater robot to waypoints while avoiding nearby obstacles. Our proposed method is based on Proximal Policy Optimization (PPO), which takes as input current relative goal information, estimated depth images, echo-sounder readings, and previous executed actions, and outputs 3D robot actions in a normalized scale. End-to-end training was done in simulation, where we adopted domain randomization (varying underwater conditions and visibility) to learn a robust policy against noise and changes in visibility conditions. The experiments in simulation and real-world demonstrated that our proposed method is successful and resilient in navigating a low-cost underwater robot in unknown underwater environments. The implementation is made publicly available at https://github.com/dartmouthrobotics/deeprl-uw-robot-navigation. 
    more » « less