We propose Deterministic Sequencing of Exploration and Exploitation (DSEE) algorithm with interleaving exploration and exploitation epochs for model-based RL problems that aim to simultaneously learn the system model, i.e., a Markov decision process (MDP), and the associated optimal policy. During exploration, DSEE explores the environment and updates the estimates for expected reward and transition probabilities. During exploitation, the latest estimates of the expected reward and transition probabilities are used to obtain a robust policy with high probability. We design the lengths of the exploration and exploitation epochs such that the cumulative regret grows as a sub-linear function of time.
more »
« less
Learning-Augmented Model-Based Planning for Visual Exploration
We consider the problem of time-limited robotic exploration in previously unseen environments where exploration is limited by a predefined amount of time. We propose a novel exploration approach using learning-augmented model-based planning. We generate a set of sub goals associated with frontiers on the current map and derive a Bellman Equation for exploration with these subgoals. Visual sensing and advances in semantic mapping of indoor scenes are exploited for training a deep convolutional neural network to estimate properties associated with each frontier: the expected unobserved area beyond the frontier and the expected time steps (discretized actions) required to explore it. The proposed model-based planner is guaranteed to explore the whole scene if time permits. We thoroughly evaluate our approach on a large-scale pseudo-realistic indoor dataset (Matterport3D) with the Habitat simulator. We compare our approach with classical and more recent RL-based exploration methods. Our approach surpasses the greedy strategies by 2.1% and the RL-based exploration methods by 8.4% in terms of coverage.
more »
« less
- Award ID(s):
- 1925231
- PAR ID:
- 10480458
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
- Format(s):
- Medium: X
- Location:
- Detroit
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Safe reinforcement learning is extremely challenging--not only must the agent explore an unknown environment, it must do so while ensuring no safety constraint violations. We formulate this safe reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function, where we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call Doubly Optimistic and Pessimistic Exploration (DOPE), and show that it achieves an objective regret $$\tilde{O}(|\mathcal{S}|\sqrt{|\mathcal{A}| K})$$ without violating the safety constraints during learning, where $$|\mathcal{S}|$$ is the number of states, $$|\mathcal{A}|$$ is the number of actions, and $$K$$ is the number of learning episodes. Our key idea is to combine a reward bonus for exploration (optimism) with a conservative constraint (pessimism), in addition to the standard optimistic model-based exploration. DOPE is not only able to improve the objective regret bound, but also shows a significant empirical performance improvement as compared to earlier optimism-pessimism approaches.more » « less
-
We propose a model-based lifelong reinforcement-learning approach that estimates a hierarchical Bayesian posterior distilling the common structure shared across different tasks. The learned posterior combined with a sample-based Bayesian exploration procedure increases the sample efficiency of learning across a family of related tasks. We first derive an analysis of the relationship between the sample complexity and the initialization quality of the posterior in the finite MDP setting. We next scale the approach to continuous-state domains by introducing a Variational Bayesian Lifelong Reinforcement Learning algorithm that can be combined with recent model-based deep RL methods, and that exhibits backward transfer. Experimental results on several challenging domains show that our algorithms achieve both better forward and backward transfer performance than state-of-the-art lifelong RL methodsmore » « less
-
Existing offline reinforcement learning (RL) methods face a few major challenges, particularly the distributional shift between the learned policy and the behavior policy. Offline Meta-RL is emerging as a promising approach to address these challenges, aiming to learn an informative meta-policy from a collection of tasks. Nevertheless, as shown in our empirical studies, offline Meta-RL could be outperformed by offline single-task RL methods on tasks with good quality of datasets, indicating that a right balance has to be delicately calibrated between "exploring" the out-of-distribution state-actions by following the meta-policy and "exploiting" the offline dataset by staying close to the behavior policy. Motivated by such empirical analysis, we propose model-based offline ta-RL with regularized policy optimization (MerPO), which learns a meta-model for efficient task structure inference and an informative meta-policy for safe exploration of out-of-distribution state-actions. In particular, we devise a new meta-Regularized model-based Actor-Critic (RAC) method for within-task policy optimization, as a key building block of MerPO, using both conservative policy evaluation and regularized policy improvement; and the intrinsic tradeoff therein is achieved via striking the right balance between two regularizers, one based on the behavior policy and the other on the meta-policy. We theoretically show that the learnt policy offers guaranteed improvement over both the behavior policy and the meta-policy, thus ensuring the performance improvement on new tasks via offline Meta-RL. Our experiments corroborate the superior performance of MerPO over existing offline Meta-RL methods.more » « less
-
This paper addresses the problem of robotic exploration of unknown indoor environments with deadlines. Indoor exploration using mobile robots has typically focused on exploring the entire environment without considering deadlines. The objective of the prioritized exploration in this paper is to rapidly compute the geometric layout of an initially unknown environment by exploring key regions of the environment and returning to the home location within a deadline. This prioritized exploration is useful for time-critical and dangerous environments where rapid robot exploration can provide vital information for subsequent operations. For example, firefighters, for whom time is of the essence, can utilize the map generated by this robotic exploration to navigate a building on fire. In our previous work, we showed that a priority-based greedy algorithm can outperform a cost-based greedy algorithm for exploration under deadlines. This paper models the prioritized exploration problem as an Orienteering Problem (OP) and a Profitable Tour Problem (PTP) in an attempt to generate exploration strategies that can explore a greater percentage of the environment in a given amount of time. The paper presents simulation results on multiple graph-based and Gazebo environments. We found that in many cases the priority-based greedy algorithm performs on par or better than the OP and PTP-based algorithms. We analyze the potential reasons for this counterintuitive result.more » « less
An official website of the United States government
