This paper presents a framework to learn the reward function underlying high-level sequential tasks from demonstrations. The purpose of reward learning, in the context of learning from demonstration (LfD), is to generate policies that mimic the demonstrator’s policies, thereby enabling imitation learning. We focus on a human-robot interaction(HRI) domain where the goal is to learn and model structured interactions between a human and a robot. Such interactions can be modeled as a partially observable Markov decision process (POMDP) where the partial observability is caused by uncertainties associated with the ways humans respond to different stimuli. The key challenge in finding a good policy in such a POMDP is determining the reward function that was observed by the demonstrator. Existing inverse reinforcement learning(IRL) methods for POMDPs are computationally very expensive and the problem is not well understood. In comparison, IRL algorithms for Markov decision process (MDP) are well defined and computationally efficient. We propose an approach of reward function learning for high-level sequential tasks from human demonstrations where the core idea is to reduce the underlying POMDP to an MDP and apply any efficient MDP-IRL algorithm. Our extensive experiments suggest that the reward function learned this way generates POMDP policies that mimic the policies of the demonstrator well.
more »
« less
Sequential Causal Imitation Learning with Unobserved Confounders
"Monkey see monkey do" is an age-old adage, referring to naive imitation without a deep understanding of a system's underlying mechanics. Indeed, if a demonstrator has access to information unavailable to the imitator (monkey), such as a different set of sensors, then no matter how perfectly the imitator models its perceived environment (See), attempting to directly reproduce the demonstrator's behavior (Do) can lead to poor outcomes. Imitation learning in the presence of a mismatch between demonstrator and imitator has been studied in the literature under the rubric of causal imitation learning (Zhang et. al. 2020), but existing solutions are limited to single-stage decision-making. This paper investigates the problem of causal imitation learning in sequential settings, where the imitator must make multiple decisions per episode. We develop a graphical criterion that is both necessary and sufficient for determining the feasibility of causal imitation, providing conditions when an imitator can match a demonstrator's performance despite differing capabilities. Finally, we provide an efficient algorithm for determining imitability, and corroborate our theory with simulations.
more »
« less
- Award ID(s):
- 2011497
- PAR ID:
- 10317646
- Date Published:
- Journal Name:
- Advances in neural information processing systems
- ISSN:
- 1049-5258
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Bossart, Janice L (Ed.)Variation in tropical forest management directly affects biodiversity and provisioning of ecosystem services on a global scale, thus it is necessary to compare forests under different conservation approaches such as protected areas, payments for ecosystem services programs (PES), and ecotourism, as well as forests lacking any formal conservation plan. To examine the effectiveness of specific conservation approaches, we examined differences in forest structure and tree recruitment, including canopy cover; canopy height; seedling, sapling, and adult tree density; and average and total diameter at breast height (DBH) across 78 plots in 18 forests across Costa Rica representing protected areas, private forests utilizing PES and/or ecotourism, and private forests not utilizing these economic incentives. The effectiveness of conservation approaches in providing suitable primate habitat was assessed by conducting broad primate census surveys across a subset of eight forests to determine species richness and group encounter rate of three primate species: mantled howler monkey (Alouatta palliata), Central American spider monkey (Ateles geoffroyi), and the white-faced capuchin monkey (Cebus imitator). Only canopy height was significantly different across the three approaches, with protected areas conserving the tallest and likely oldest forests. Canopy height was also significantly associated with the group encounter rate for both mantled howler and spider monkeys, but not for capuchins. Total group encounter rate for all three monkey species combined was higher in incentivized forests than in protected areas, with capuchin and howler monkey group encounter rates driving the trend. Group encounter rate for spider monkeys was higher in protected areas than in incentivized forests. Incentivized conservation (PES and ecotourism) and protected areas are paragons of land management practices that can lead to variation in forest structure across a landscape, which not only protect primate communities, but support the dietary ecologies of sympatric primate species.more » « less
-
null (Ed.)Many methods in learning from demonstration assume that the demonstrator has knowledge of the full environment. However, in many scenarios, a demonstrator only sees part of the environment and they continuously replan as they gather information. To plan new paths or to reconstruct the environment, we must consider the visibility constraints and replanning process of the demonstrator, which, to our knowledge, has not been done in previous work. We consider the problem of inferring obstacle configurations in a 2D environment from demonstrated paths for a point robot that is capable of seeing in any direction but not through obstacles. Given a set of survey points, which describe where the demonstrator obtains new information, and a candidate path, we con-struct a Constraint Satisfaction Problem (CSP) on a cell decomposition of the environment. We parameterize a set of obstacles corresponding to an assignment from the CSP and sample from the set to find valid environments. We show that there is a probabilistically-complete, yet not entirely tractable, algorithm that can guarantee novel paths in the space are unsafe or possibly safe. We also present an incomplete, but empirically-successful, heuristic-guided algorithm that we apply in our experiments to 1) planning novel paths and 2) recovering a probabilistic representation of the environment.more » « less
-
We study the problem of imitation learning via inverse reinforcement learning where the agent attempts to learn an expert's policy from a dataset of collected state, action tuples. We derive a new Robust model-based Offline Imitation Learning method (ROIL) that mitigates covariate shift by avoiding estimating the expert's occupancy frequency. Frequently in offline settings, there is insufficient data to reliably estimate the expert's occupancy frequency and this leads to models that do not generalize well. Our proposed approach, ROIL, is a method that is guaranteed to recover the expert's occupancy frequency and is efficiently solvable as an LP. We demonstrate ROIL's ability to achieve minimal regret in large environments under covariate shift, such as when the state visitation frequency of the demonstrations does not come from the expert.more » « less
-
Existing computer analytic methods for the microgrid system, such as reinforcement learning (RL) methods, suffer from a long-term problem with the empirical assumption of the reward function. To alleviate this limitation, we propose a multi-virtual-agent imitation learning (MAIL) approach to learn the dispatch policy under different power supply interrupted periods. Specifically, we utilize the idea of generative adversarial imitation learning method to do direct policy mapping, instead of learning from manually designed reward functions. Multi-virtual agents are used for exploring the relationship of uncertainties and corresponding actions in different microgrid environments in parallel. With the help of a deep neural network, the proposed MAIL approach can enhance robust ability by minimizing the maximum crossover discriminators to cover more interrupted cases. Case studies show that the proposed MAIL approach can learn the dispatch policies as well as the expert method and outperform other existing RL methods.more » « less
An official website of the United States government

