skip to main content


Title: Bayesian Reinforcement Learning in Factored POMDPs
Model-based Bayesian Reinforcement Learning (BRL) provides a principled solution to dealing with the exploration-exploitation trade-off, but such methods typically assume a fully observable environments. The few Bayesian RL methods that are applicable in partially observable domains, such as the Bayes-Adaptive POMDP (BA-POMDP), scale poorly. To address this issue, we introduce the Factored BA-POMDP model (FBA-POMDP), a framework that is able to learn a compact model of the dynamics by exploiting the underlying structure of a POMDP. The FBA-POMDP framework casts the problem as a planning task, for which we adapt the Monte-Carlo Tree Search planning algorithm and develop a belief tracking method to approximate the joint posterior over the state and model variables. Our empirical results show that this method outperforms a number of BRL baselines and is able to learn efficiently when the factorization is known, as well as learn both the factorization and the model parameters simultaneously.  more » « less
Award ID(s):
1734497
NSF-PAR ID:
10098819
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems
ISSN:
1548-8403
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper presents a framework to learn the reward function underlying high-level sequential tasks from demonstrations. The purpose of reward learning, in the context of learning from demonstration (LfD), is to generate policies that mimic the demonstrator’s policies, thereby enabling imitation learning. We focus on a human-robot interaction(HRI) domain where the goal is to learn and model structured interactions between a human and a robot. Such interactions can be modeled as a partially observable Markov decision process (POMDP) where the partial observability is caused by uncertainties associated with the ways humans respond to different stimuli. The key challenge in finding a good policy in such a POMDP is determining the reward function that was observed by the demonstrator. Existing inverse reinforcement learning(IRL) methods for POMDPs are computationally very expensive and the problem is not well understood. In comparison, IRL algorithms for Markov decision process (MDP) are well defined and computationally efficient. We propose an approach of reward function learning for high-level sequential tasks from human demonstrations where the core idea is to reduce the underlying POMDP to an MDP and apply any efficient MDP-IRL algorithm. Our extensive experiments suggest that the reward function learned this way generates POMDP policies that mimic the policies of the demonstrator well. 
    more » « less
  2. Recent work has considered personalized route planning based on user profiles, but none of it accounts for human trust. We argue that human trust is an important factor to consider when planning routes for automated vehicles. This article presents a trust-based route-planning approach for automated vehicles. We formalize the human-vehicle interaction as a partially observable Markov decision process (POMDP) and model trust as a partially observable state variable of the POMDP, representing the human’s hidden mental state. We build data-driven models of human trust dynamics and takeover decisions, which are incorporated in the POMDP framework, using data collected from an online user study with 100 participants on the Amazon Mechanical Turk platform. We compute optimal routes for automated vehicles by solving optimal policies in the POMDP planning and evaluate the resulting routes via human subject experiments with 22 participants on a driving simulator. The experimental results show that participants taking the trust-based route generally reported more positive responses in the after-driving survey than those taking the baseline (trust-free) route. In addition, we analyze the trade-offs between multiple planning objectives (e.g., trust, distance, energy consumption) via multi-objective optimization of the POMDP. We also identify a set of open issues and implications for real-world deployment of the proposed approach in automated vehicles. 
    more » « less
  3. Representing and reasoning about uncertainty is crucial for autonomous agents acting in partially observable environments with noisy sensors. Partially observable Markov decision processes (POMDPs) serve as a general framework for representing problems in which uncertainty is an important factor. Online sample-based POMDP methods have emerged as efficient approaches to solving large POMDPs and have been shown to extend to continuous domains. However, these solutions struggle to find long-horizon plans in problems with significant uncertainty. Exploration heuristics can help guide planning, but many real-world settings contain significant task-irrelevant uncertainty that might distract from the task objective. In this paper, we propose STRUG, an online POMDP solver capable of handling domains that require long-horizon planning with significant task-relevant and task-irrelevant uncertainty. We demonstrate our solution on several temporally extended versions of toy POMDP problems as well as robotic manipulation of articulated objects using a neural perception frontend to construct a distribution of possible models. Our results show that STRUG outperforms the current samplebased online POMDP solvers on several tasks. 
    more » « less
  4. null (Ed.)
    Purpose : Personalized screening guidelines can be an effective strategy to prevent diabetic retinopathy (DR)-related vision loss. However, these strategies typically do not capture behavior-based factors such as a patient’s compliance or cost preferences. This study develops a mathematical model to identify screening policies that capture both DR progression and behavioral factors to provide personalized recommendations. Methods : A partially observable Markov decision process model (POMDP) is developed to provide personalized screening recommendations. For each patient, the model estimates the patient’s probability of having a sight-threatening diabetic eye disorder (STDED) yearly via Bayesian inference based on natural history, screening results, and compliance behavior. The model then determines a personalized, threshold-based recommendation for each patient annually--either no action (NA), teleretinal imaging (TRI), or clinical screening (CS)--based on the patient’s current probability of having STDED as well as patient-specific preference between cost saving ($) and QALY gain. The framework is applied to a hypothetical cohort of 40-year-old African American male patients. Results : For the base population with TRI and CS compliance rates of 65% and 55% and equal preference for cost and QALY, NA is identified as an optimal recommendation when the patient’s probability of having STDED is less than 0.72%, TRI when the probability is [0.72%, 2.09%], and CS when the probability is above 2.09%. Simulated against annual clinical screening, the model-based policy finds an average decrease of 7.07% in cost/QALY (95% CI; 6.93-7.23%) and 15.05% in blindness prevalence over a patient’s lifetime (95% CI; 14.88-15.23%). For patients with equal preference for cost and QALY, the model identifies 6 different types of threshold-based policies (See Fig 1). For patients with strong preference for QALY gain, CS-only policies had an increase in prevalence by a factor of 19.2 (see Fig 2). Conclusions : The POMDP model is highly flexible and responsive in incorporating behavioral factors when providing personalized screening recommendations. As a decision support tool, providers can use this modeling framework to provide unique, catered recommendations. 
    more » « less
  5. Abstract To be responsive to dynamically changing real-world environments, an intelligent agent needs to perform complex sequential decision-making tasks that are often guided by commonsense knowledge. The previous work on this line of research led to the framework called interleaved commonsense reasoning and probabilistic planning (i corpp ), which used P-log for representing commmonsense knowledge and Markov Decision Processes (MDPs) or Partially Observable MDPs (POMDPs) for planning under uncertainty. A main limitation of i corpp is that its implementation requires non-trivial engineering efforts to bridge the commonsense reasoning and probabilistic planning formalisms. In this paper, we present a unified framework to integrate i corpp ’s reasoning and planning components. In particular, we extend probabilistic action language pBC + to express utility, belief states, and observation as in POMDP models. Inheriting the advantages of action languages, the new action language provides an elaboration tolerant representation of POMDP that reflects commonsense knowledge. The idea led to the design of the system pbcplus2pomdp , which compiles a pBC + action description into a POMDP model that can be directly processed by off-the-shelf POMDP solvers to compute an optimal policy of the pBC + action description. Our experiments show that it retains the advantages of i corpp while avoiding the manual efforts in bridging the commonsense reasoner and the probabilistic planner. 
    more » « less