skip to main content


Title: A Joint Planning and Learning Framework for Human-Aided Decision-Making
Conventional reinforcement learning (RL) allows an agent to learn policies via environmental rewards only, with a long and slow learning curve, especially at the beginning stage. On the contrary, human learning is usually much faster because prior and general knowledge and multiple information resources are utilized. In this paper, we propose a PlannerActor-Critic architecture for huMAN-centered planning and learning (PACMAN), where an agent uses prior, high-level, deterministic symbolic knowledge to plan for goal-directed actions. PACMAN integrates Actor-Critic algorithm of RL to fine-tune its behavior towards both environmental rewards and human feedback. To the best our knowledge, This is the first unified framework where knowledge-based planning, RL, and human teaching jointly contribute to the policy learning of an agent. Our experiments demonstrate that PACMAN leads to a significant jump-start at the early stage of learning, converges rapidly and with small variance, and is robust to inconsistent, infrequent, and misleading feedback.  more » « less
Award ID(s):
1910794
NSF-PAR ID:
10169459
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
AAAI Fall Symposium
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent successes of Reinforcement Learning (RL) allow an agent to learn policies that surpass human experts but suffers from being time-hungry and data-hungry. By contrast, human learning is significantly faster because prior and general knowledge and multiple information resources are utilized. In this paper, we propose a Planner-Actor-Critic architecture for huMAN-centered planning and learning (PACMAN), where an agent uses its prior, high-level, deterministic symbolic knowledge to plan for goal-directed actions, and also integrates the Actor-Critic algorithm of RL to fine-tune its behavior towards both environmental rewards and human feedback. This work is the first unified framework where knowledge-based planning, RL, and human teaching jointly contribute to the policy learning of an agent. Our experiments demonstrate that PACMAN leads to a significant jump-start at the early stage of learning, converges rapidly and with small variance, and is robust to inconsistent, infrequent, and misleading feedback. 
    more » « less
  2. Symbolic planning models allow decision-making agents to sequence actions in arbitrary ways to achieve a variety of goals in dynamic domains. However, they are typically handcrafted and tend to require precise formulations that are not robust to human error. Reinforcement learning (RL) approaches do not require such models, and instead learn domain dynamics by exploring the environment and collecting rewards. However, RL approaches tend to require millions of episodes of experience and often learn policies that are not easily transferable to other tasks. In this paper, we address one aspect of the open problem of integrating these approaches: how can decision-making agents resolve discrepancies in their symbolic planning models while attempting to accomplish goals? We propose an integrated framework named SPOTTER that uses RL to augment and support ("spot") a planning agent by discovering new operators needed by the agent to accomplish goals that are initially unreachable for the agent. SPOTTER outperforms pure-RL approaches while also discovering transferable symbolic knowledge and does not require supervision, successful plan traces or any a priori knowledge about the missing planning operator. 
    more » « less
  3. Decision-making under uncertainty (DMU) is present in many important problems. An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time. Reinforcement Learning (RL), a popular approach for DMU problems, learns a policy by interacting with a model of the environment offline. Unfortunately, if the environment changes the policy can become stale and take sub-optimal actions, and relearning the policy for the updated environment takes time and computational effort. An alternative is online planning approaches such as Monte Carlo Tree Search (MCTS), which perform their computation at decision time. Given the current environment, MCTS plans using high-fidelity models to determine promising action trajectories. These models can be updated as soon as environmental changes are detected to immediately incorporate them into decision making. However, MCTS’s convergence can be slow for domains with large state-action spaces. In this paper, we present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses. Our approach, called Policy Augmented MCTS (PA-MCTS), integrates a policy’s actin-value estimates into MCTS, using the estimates to seed the action trajectories favored by the search. We hypothesize that PA-MCTS will converge more quickly than standard MCTS while making better decisions than the policy can make on its own when faced with nonstationary environments. We test our hypothesis by comparing PA-MCTS with pure MCTS and an RL agent applied to the classical CartPole environment. We find that PC-MCTS can achieve higher cumulative rewards than the policy in isolation under several environmental shifts while converging in significantly fewer iterations than pure MCTS. 
    more » « less
  4. Circuit linearity calibration can represent a set of high-dimensional search problems if the observability is limited. For example, linearity calibration of digital-to-time converters (DTC), an essential building block of modern digital phaselocked loops (DPLLs), is an example of a high-dimensional search problem as difficulty of measuring ps delays hinders prior methods that calibrate stage by stage. And, a calibrated DTC can become nonlinear again due to changes in temperature (T) and power supply voltage (V). Prior work reports a deep reinforcement learning framework that is capable of performing DTC linearity calibration with nonlinear calibration banks; however, this prior work does not address maintaining calibration in the face of temperature and supply voltage variations. In this paper, we present a meta-reinforcement learning (RL) method that can enable the RL agent to quickly adapt to a new environment when the temperature and/or voltage change. Inspired by the Style Generative Adversarial Networks (StyleGANs), we propose to treat temperature and voltage changes as the styles of the circuits. In contrast to traditional methods employing circuit sensors to detect changes in T and V, we utilize a machine learning (ML) sensor, to implicitly infer a wide range of environmental changes. The style information from the ML sensor is subsequently injected into a small portion of the policy network, modulating its weights. As a proof of concept, we first designed a 5-bit DTC at the normal voltage (1V) and normal temperature (27℃) corner (NVNT) as the environment. The RL agent begins its training in the NVNT environment. Following this initial phase, the agent is then tasked with adapting to environments with different temperature and supply voltages. Our results show that the proposed technique can reduce the Integral Non-Linearity (INL) to less than 0.5 LSB within 10, 000 search steps in a changed environment. Compared to starting learning from a random initialized policy and a trained policy, the proposed meta-RL approach takes 63% and 47% fewer steps to complete the linearity calibration, respectively. Our method is also applicable to the calibration of many other kinds of analog and RF circuits. 
    more » « less
  5. null (Ed.)
    Interactive reinforcement learning (IRL) agents use human feedback or instruction to help them learn in complex environments. Often, this feedback comes in the form of a discrete signal that’s either positive or negative. While informative, this information can be difficult to generalize on its own. In this work, we explore how natural language advice can be used to provide a richer feedback signal to a reinforcement learning agent by extending policy shaping, a well-known IRL technique. Usually policy shaping employs a human feedback policy to help an agent to learn more about how to achieve its goal. In our case, we replace this human feedback policy with policy generated based on natural language advice. We aim to inspect if the generated natural language reasoning provides support to a deep RL agent to decide its actions successfully in any given environment. So, we design our model with three networks: first one is the experience driven, next is the advice generator and third one is the advice driven. While the experience driven RL agent chooses its actions being influenced by the environmental reward, the advice driven neural network with generated feedback by the advice generator for any new state selects its actions to assist the RL agent to better policy shaping. 
    more » « less