skip to main content

This content will become publicly available on July 1, 2024

Title: RLang: A Declarative Language for Describing Partial World Knowledge to Reinforcement Learning Agents
We introduce RLang, a domain-specific language (DSL) for communicating domain knowledge to an RL agent. Unlike existing RL DSLs that ground to single elements of a decision-making formalism (e.g., the reward function or policy), RLang can specify information about every element of a Markov decision process. We define precise syntax and grounding semantics for RLang, and provide a parser that grounds RLang programs to an algorithm-agnostic partial world model and policy that can be exploited by an RL agent. We provide a series of example RLang programs demonstrating how different RL methods can exploit the resulting knowledge, encompassing model-free and model-based tabular algorithms, policy gradient and value-based methods, hierarchical approaches, and deep methods.  more » « less
Award ID(s):
1955361 1844960
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Proceedings of the 40th International Conference on Machine Learning
Date Published:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Perimeter metering control has long been an active research topic since well-defined relationships between network productivity and usage, that is, network macroscopic fundamental diagrams (MFDs), were shown to be capable of describing regional traffic dynamics. Numerous methods have been proposed to solve perimeter metering control problems, but these generally require knowledge of the MFDs or detailed equations that govern traffic dynamics. Recently, a study applied model-free deep reinforcement learning (Deep-RL) methods to two-region perimeter control and found comparable performances to the model predictive control scheme, particularly when uncertainty exists. However, the proposed methods therein provide very low initial performances during the learning process, which limits their applicability to real life scenarios. Furthermore, the methods may not be scalable to more complicated networks with larger state and action spaces. To combat these issues, this paper proposes to integrate the domain control knowledge (DCK) of congestion dynamics into the agent designs for improved learning and control performances. A novel agent is also developed that builds on the Bang-Bang control policy. Two types of DCK are then presented to provide knowledge-guided exploration strategies for the agents such that they can explore around the most rewarding part of the action spaces. The results from extensive numerical experiments on two- and three-region urban networks show that integrating DCK can (a) effectively improve learning and control performances for Deep-RL agents, (b) enhance the agents’ resilience against various types of environment uncertainties, and (c) mitigate the scalability issue for the agents. 
    more » « less
  2. Symbolic planning models allow decision-making agents to sequence actions in arbitrary ways to achieve a variety of goals in dynamic domains. However, they are typically handcrafted and tend to require precise formulations that are not robust to human error. Reinforcement learning (RL) approaches do not require such models, and instead learn domain dynamics by exploring the environment and collecting rewards. However, RL approaches tend to require millions of episodes of experience and often learn policies that are not easily transferable to other tasks. In this paper, we address one aspect of the open problem of integrating these approaches: how can decision-making agents resolve discrepancies in their symbolic planning models while attempting to accomplish goals? We propose an integrated framework named SPOTTER that uses RL to augment and support ("spot") a planning agent by discovering new operators needed by the agent to accomplish goals that are initially unreachable for the agent. SPOTTER outperforms pure-RL approaches while also discovering transferable symbolic knowledge and does not require supervision, successful plan traces or any a priori knowledge about the missing planning operator. 
    more » « less
  3. Decision-making under uncertainty (DMU) is present in many important problems. An open challenge is DMU in non-stationary environments, where the dynamics of the environment can change over time. Reinforcement Learning (RL), a popular approach for DMU problems, learns a policy by interacting with a model of the environment offline. Unfortunately, if the environment changes the policy can become stale and take sub-optimal actions, and relearning the policy for the updated environment takes time and computational effort. An alternative is online planning approaches such as Monte Carlo Tree Search (MCTS), which perform their computation at decision time. Given the current environment, MCTS plans using high-fidelity models to determine promising action trajectories. These models can be updated as soon as environmental changes are detected to immediately incorporate them into decision making. However, MCTS’s convergence can be slow for domains with large state-action spaces. In this paper, we present a novel hybrid decision-making approach that combines the strengths of RL and planning while mitigating their weaknesses. Our approach, called Policy Augmented MCTS (PA-MCTS), integrates a policy’s actin-value estimates into MCTS, using the estimates to seed the action trajectories favored by the search. We hypothesize that PA-MCTS will converge more quickly than standard MCTS while making better decisions than the policy can make on its own when faced with nonstationary environments. We test our hypothesis by comparing PA-MCTS with pure MCTS and an RL agent applied to the classical CartPole environment. We find that PC-MCTS can achieve higher cumulative rewards than the policy in isolation under several environmental shifts while converging in significantly fewer iterations than pure MCTS. 
    more » « less
  4. null (Ed.)
    Interactive reinforcement learning (IRL) agents use human feedback or instruction to help them learn in complex environments. Often, this feedback comes in the form of a discrete signal that’s either positive or negative. While informative, this information can be difficult to generalize on its own. In this work, we explore how natural language advice can be used to provide a richer feedback signal to a reinforcement learning agent by extending policy shaping, a well-known IRL technique. Usually policy shaping employs a human feedback policy to help an agent to learn more about how to achieve its goal. In our case, we replace this human feedback policy with policy generated based on natural language advice. We aim to inspect if the generated natural language reasoning provides support to a deep RL agent to decide its actions successfully in any given environment. So, we design our model with three networks: first one is the experience driven, next is the advice generator and third one is the advice driven. While the experience driven RL agent chooses its actions being influenced by the environmental reward, the advice driven neural network with generated feedback by the advice generator for any new state selects its actions to assist the RL agent to better policy shaping. 
    more » « less
  5. null (Ed.)
    Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher’s preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent’s past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions. 
    more » « less