skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Title: A Human-Centered Data-Driven Planner-Actor-Critic Architecture via Logic Programming
Recent successes of Reinforcement Learning (RL) allow an agent to learn policies that surpass human experts but suffers from being time-hungry and data-hungry. By contrast, human learning is significantly faster because prior and general knowledge and multiple information resources are utilized. In this paper, we propose a Planner-Actor-Critic architecture for huMAN-centered planning and learning (PACMAN), where an agent uses its prior, high-level, deterministic symbolic knowledge to plan for goal-directed actions, and also integrates the Actor-Critic algorithm of RL to fine-tune its behavior towards both environmental rewards and human feedback. This work is the first unified framework where knowledge-based planning, RL, and human teaching jointly contribute to the policy learning of an agent. Our experiments demonstrate that PACMAN leads to a significant jump-start at the early stage of learning, converges rapidly and with small variance, and is robust to inconsistent, infrequent, and misleading feedback.  more » « less
Award ID(s):
1910794
PAR ID:
10169406
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
35th International Conference on Logic Programming
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Conventional reinforcement learning (RL) allows an agent to learn policies via environmental rewards only, with a long and slow learning curve, especially at the beginning stage. On the contrary, human learning is usually much faster because prior and general knowledge and multiple information resources are utilized. In this paper, we propose a PlannerActor-Critic architecture for huMAN-centered planning and learning (PACMAN), where an agent uses prior, high-level, deterministic symbolic knowledge to plan for goal-directed actions. PACMAN integrates Actor-Critic algorithm of RL to fine-tune its behavior towards both environmental rewards and human feedback. To the best our knowledge, This is the first unified framework where knowledge-based planning, RL, and human teaching jointly contribute to the policy learning of an agent. Our experiments demonstrate that PACMAN leads to a significant jump-start at the early stage of learning, converges rapidly and with small variance, and is robust to inconsistent, infrequent, and misleading feedback. 
    more » « less
  2. Reinforcement learning (RL) is mechanized to learn from experience. It solves the problem in sequential decisions by optimizing reward-punishment through experimentation of the distinct actions in an environment. Unlike supervised learning models, RL lacks static input-output mappings and the objective of minimization of a vector error. However, to find out an optimal strategy, it is crucial to learn both continuous feedback from training data and the offline rules of the experiences with no explicit dependence on online samples. In this paper, we present a study of a multi-agent RL framework which involves a Critic in semi-offline mode criticizing over an online Actor-Critic network, namely, Critic-over-Actor-Critic (CoAC) model, in finding optimal treatment plan of ICU patients as well as optimal strategy in a combative battle game. For further validation, we also examine the model in the adversarial assignment. 
    more » « less
  3. This paper describes how domain knowledge of power system operators can be integrated into reinforcement learning (RL) frameworks to effectively learn agents that control the grid's topology to prevent thermal cascading. Typical RL-based topology controllers fail to perform well due to the large search/optimization space. Here, we propose an actor-critic-based agent to address the problem's combinatorial nature and train the agent using the RL environment developed by RTE, the French TSO. To address the challenge of the large optimization space, a curriculum-based approach with reward tuning is incorporated into the training procedure by modifying the environment using network physics for enhanced agent learning. Further, a parallel training approach on multiple scenarios is employed to avoid biasing the agent to a few scenarios and make it robust to the natural variability in grid operations. Without these modifications to the training procedure, the RL agent failed for most test scenarios, illustrating the importance of properly integrating domain knowledge of physical systems for real-world RL learning. The agent was tested by RTE for the 2019 learning to run the power network challenge and was awarded the 2nd place in accuracy and 1st place in speed. The developed code is open-sourced for public use. Analysis of a simple system proves the enhancement in training RL-agents using the curriculum. 
    more » « less
  4. Living organisms learn on multiple time scales: evolutionary as well as individual-lifetime learning. These two learning modes are complementary: the innate phenotypes developed through evolution significantly influence lifetime learning. However, it is still unclear how these two learning methods interact and whether there is a benefit to part of the system being optimized on a different time scale using a population-based approach while the rest of it is trained on a different time-scale using an individualistic learning algorithm. In this work, we study the benefits of such a hybrid approach using an actor-critic framework where the critic part of an agent is optimized over evolutionary time based on its ability to train the actor part of an agent during its lifetime. Typically, critics are optimized on the same time-scale as the actor using the Bellman equation to represent long-term expected reward. We show that evolution can find a variety of different solutions that can still enable an actor to learn to perform a behavior during its lifetime. We also show that although the solutions found by evolution represent different functions, they all provide similar training signals during the lifetime. This suggests that learning on multiple time-scales can effectively simplify the overall optimization process in the actor-critic framework by finding one of many solutions that can still train an actor just as well. Furthermore, analysis of the evolved critics can yield additional possibilities for reinforcement learning beyond the Bellman equation. 
    more » « less
  5. We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks. 
    more » « less