In reinforcement learning (RL), the ability to utilize prior knowledge from previously solved tasks can allow agents to quickly solve new problems. In some cases, these new problems may be approximately solved by composing the solutions of previously solved primitive tasks (task composition). Otherwise, prior knowledge can be used to adjust the reward function for a new problem, in a way that leaves the optimal policy unchanged but enables quicker learning (reward shaping). In this work, we develop a general framework for reward shaping and task composition in entropy-regularized RL. To do so, we derive an exact relation connecting the optimal soft value functions for two entropy-regularized RL problems with different reward functions and dynamics. We show how the derived relation leads to a general result for reward shaping in entropy-regularized RL. We then generalize this approach to derive an exact relation connecting optimal value functions for the composition of multiple tasks in entropy-regularized RL. We validate these theoretical contributions with experiments showing that reward shaping and task composition lead to faster learning in various settings.
more »
« less
This content will become publicly available on April 11, 2026
Bootstrapped Reward Shaping
In reinforcement learning, especially in sparse-reward domains, many environment steps are required to observe reward information. In order to increase the frequency of such observations, potential-based reward shaping (PBRS) has been proposed as a method of providing a more dense reward signal while leaving the optimal policy invariant. However, the required potential function must be carefully designed with task-dependent knowledge to not deter training performance. In this work, we propose a bootstrapped method of reward shaping, termed BS-RS, in which the agent's current estimate of the state-value function acts as the potential function for PBRS. We provide convergence proofs for the tabular setting, give insights into training dynamics for deep RL, and show that the proposed method improves training speed in the Atari suite.
more »
« less
- PAR ID:
- 10616109
- Publisher / Repository:
- Association for the Advancement of Artificial Intelligence
- Date Published:
- Journal Name:
- Proceedings of the AAAI Conference on Artificial Intelligence
- Volume:
- 39
- Issue:
- 15
- ISSN:
- 2159-5399
- Page Range / eLocation ID:
- 15302 to 15310
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In multiagent problems that require complex joint actions, reward shaping methods yield good behavior by incentivizing the agents’ potentially valuable actions. However, reward shaping often requires access to the functional form of the reward function and the global state of the system. In this work, we introduce the Exploratory Gaussian Reward (EGR), a new reward model that creates optimistic stepping stone rewards linking the agents potentially good actions to the desired joint action. EGR models the system reward as a Gaussian Process to leverage the inherent uncertainty in reward estimates that push agents to explore unobserved state space. In the tightly coupled rover coordination problem, we show that EGR significantly outperforms a neural network approximation baseline and is comparable to the system with access to the functional form of the global reward. Finally, we demonstrate how EGR improves performance over other reward shaping methods by forcing agents to explore and escape local optima.more » « less
-
We propose a new complexity measure for Markov decision processes (MDPs), the maximum expected hitting cost (MEHC). This measure tightens the closely related notion of diameter by accounting for the reward structure. We show that this parameter replaces diameter in the upper bound on the optimal value span of an extended MDP, thus refining the associated upper bounds on the regret of several UCRL2-like algorithms. Furthermore, we show that potential-based reward shaping can induce equivalent reward functions with varying informativeness, as measured by MEHC. We further establish that shaping can reduce or increase MEHC by at most a factor of two in a large class of MDPs with finite MEHC and unsaturated optimal average rewards.more » « less
-
This letter aims to develop a new learning-based flocking control framework that ensures inter-agent free collision. To achieve this goal, a leader-following flocking control based on a deep Q-network (DQN) is designed to comply with the three Reynolds’ flocking rules. However, due to the inherent conflict between the navigation attraction and inter-agent repulsion in the leader-following flocking scenario, there exists a potential risk of inter-agent collisions, particularly with limited training episodes. Failure to prevent such collision not only caused penalties in training but could lead to damage when the proposed control framework is executed on hardware. To address this issue, a control barrier function (CBF) is incorporated into the learning strategy to ensure collision-free flocking behavior. Moreover, the proposed learning framework with CBF enhances training efficiency and reduces the complexity of reward function design and tuning. Simulation results demonstrate the effectiveness and benefits of the proposed learning methodology and control framework.more » « less
-
Existing computer analytic methods for the microgrid system, such as reinforcement learning (RL) methods, suffer from a long-term problem with the empirical assumption of the reward function. To alleviate this limitation, we propose a multi-virtual-agent imitation learning (MAIL) approach to learn the dispatch policy under different power supply interrupted periods. Specifically, we utilize the idea of generative adversarial imitation learning method to do direct policy mapping, instead of learning from manually designed reward functions. Multi-virtual agents are used for exploring the relationship of uncertainties and corresponding actions in different microgrid environments in parallel. With the help of a deep neural network, the proposed MAIL approach can enhance robust ability by minimizing the maximum crossover discriminators to cover more interrupted cases. Case studies show that the proposed MAIL approach can learn the dispatch policies as well as the expert method and outperform other existing RL methods.more » « less
An official website of the United States government
