skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Eliciting User Preferences for Personalized Multi-Objective Decision Making through Comparative Feedback
In this work, we propose a multi-objective decision making framework that accommodates different user preferences over objectives, where preferences are learned via policy comparisons. Our model consists of a known Markov decision process with a vector-valued reward function, with each user having an unknown preference vector that expresses the relative importance of each objective. The goal is to efficiently compute a near-optimal policy for a given user. We consider two user feedback models. We first address the case where a user is provided with two policies and returns their preferred policy as feedback. We then move to a different user feedback model, where a user is instead provided with two small weighted sets of representative trajectories and selects the preferred one. In both cases, we suggest an algorithm that finds a nearly optimal policy for the user using a number of comparison queries that scales quasilinearly in the number of objectives.  more » « less
Award ID(s):
2216899 2212968
PAR ID:
10511443
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
37th Conference on Neural Information Processing Systems (NeurIPS 2023)
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we study planning in stochastic systems, modeled as Markov decision processes (MDPs), with preferences over temporally extended goals. Prior work on temporal planning with preferences assumes that the user preferences form a total order, meaning that every pair of outcomes are comparable with each other. In this work, we consider the case where the preferences over possible outcomes are a partial order rather than a total order. We first introduce a variant of deterministic finite automaton, referred to as a preference DFA, for specifying the user's preferences over temporally extended goals. Based on the order theory, we translate the preference DFA to a preference relation over policies for probabilistic planning in a labeled MDP. In this treatment, a most preferred policy induces a weak-stochastic nondominated probability distribution over the finite paths in the MDP. The proposed planning algorithm hinges on the construction of a multi-objective MDP. We prove that a weak-stochastic nondominated policy given the preference specification is Pareto-optimal in the constructed multi-objective MDP, and vice versa. Throughout the paper, we employ a running example to demonstrate the proposed preference specification and solution approaches. We show the efficacy of our algorithm using the example with detailed analysis, and then discuss possible future directions. 
    more » « less
  2. This paper studies temporal planning in probabilistic environments, modeled as labeled Markov decision processes (MDPs), with user preferences over multiple temporal goals. Existing works reflect such preferences as a prioritized list of goals. This paper introduces a new specification language, termed prioritized qualitative choice linear temporal logic on finite traces, which augments linear temporal logic on finite traces with prioritized conjunction and ordered disjunction from prioritized qualitative choice logic. This language allows for succinctly specifying temporal objectives with corresponding preferences accomplishing each temporal task. The finite traces that describe the system's behaviors are ranked based on their dissatisfaction scores with respect to the formula. We propose a systematic translation from the new language to a weighted deterministic finite automaton. Utilizing this computational model, we formulate and solve a problem of computing an optimal policy that minimizes the expected score of dissatisfaction given user preferences. We demonstrate the efficacy and applicability of the logic and the algorithm on several case studies with detailed analyses for each. 
    more » « less
  3. The expanding role of reinforcement learning (RL) in safety-critical system design has promoted ω-automata as a way to express learning requirements—often non-Markovian—with greater ease of expression and interpretation than scalar reward signals. However, real-world sequential decision making situations often involve multiple, potentially conflicting, objectives. Two dominant approaches to express relative preferences over multiple objectives are: (1)weighted preference, where the decision maker provides scalar weights for various objectives, and (2)lexicographic preference, where the decision maker provides an order over the objectives such that any amount of satisfaction of a higher-ordered objective is preferable to any amount of a lower-ordered one. In this article, we study and develop RL algorithms to compute optimal strategies in Markov decision processes against multiple ω-regular objectives under weighted and lexicographic preferences. We provide a translation from multiple ω-regular objectives to a scalar reward signal that is bothfaithful(maximising reward means maximising probability of achieving the objectives under the corresponding preference) andeffective(RL quickly converges to optimal strategies). We have implemented the translations in a formal reinforcement learning tool,Mungojerrie, and we present an experimental evaluation of our technique on benchmark learning problems. 
    more » « less
  4. Hillston, Jane; Soudjiani, Sadegh (Ed.)
    We study the problem of inferring the discount factor of an agent optimizing a discounted reward objective in a finite state Markov Decision Process (MDP). Discounted reward objectives are common in sequential optimization, reinforcement learning, and algorithmic game theory. The discount factor is an important parameter used in formulating the discounted reward. It captures the “time value” of the reward- i.e., how much reward at hand would equal a promised reward at a future time. Knowing an agent’s discount factor can provide valuable insights into their decision-making, and help predict their preferences in previously unseen environments. However, pinpointing the exact value of the discount factor used by the agent is a challenging problem. Ad-hoc guesses are often incorrect. This paper focuses on the problem of computing the range of possible discount factors for a rational agent given their policy. A naive solution to this problem can be quite expensive. A classic result by Smallwood shows that the interval [0, 1) of possible discount factor can be partitioned into finitely many sub-intervals, such that the optimal policy remains the same for each such sub-interval. Furthermore, optimal policies for neighboring sub-intervals differ for a single state. We show how Smallwood’s result can be exploited to search for discount factor intervals for which a given policy is optimal by reducing it to polynomial root isolation. We extend the result to situations where the policy is suboptimal, but with a value function that is close to optimal. We develop numerical approaches to solve the discount factor elicitation problem and demonstrate the effectiveness of our algorithms through some case studies. 
    more » « less
  5. null (Ed.)
    Omega-regular properties—specified using linear time temporal logic or various forms of omega-automata—find increasing use in specifying the objectives of reinforcement learning (RL). The key problem that arises is that of faithful and effective translation of the objective into a scalar reward for model-free RL. A recent approach exploits Büchi automata with restricted nondeterminism to reduce the search for an optimal policy for an Open image in new window-regular property to that for a simple reachability objective. A possible drawback of this translation is that reachability rewards are sparse, being reaped only at the end of each episode. Another approach reduces the search for an optimal policy to an optimization problem with two interdependent discount parameters. While this approach provides denser rewards than the reduction to reachability, it is not easily mapped to off-the-shelf RL algorithms. We propose a reward scheme that reduces the search for an optimal policy to an optimization problem with a single discount parameter that produces dense rewards and is compatible with off-the-shelf RL algorithms. Finally, we report an experimental comparison of these and other reward schemes for model-free RL with omega-regular objectives. 
    more » « less