<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Multi-Objective Controller Synthesis with Uncertain Human Preferences</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10328185</idno>
					<idno type="doi"></idno>
					<title level='j'>ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>S. Chen</author><author>K. Boggess</author><author>D. Parker</author><author>L. Feng</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Complex real-world applications of cyber-physical systems give rise to the need for multi-objective controller synthesis, which concerns the problem of computing an optimal controller subject to multiple (possibly conicting) criteria. The relative importance of objectives is often specied by human decision-makers. However, there is inherent uncertainty in human preferences (e.g., due to artifacts resulting from dierent preference elicitation methods). In this paper, we formalize the notion of uncertain human preferences, and present a novel approach that accounts for this uncertainty in the context of multi-objective controller synthesis for Markov decision processes (MDPs). Our approach is based on mixed-integer linear programming and synthesizes an optimally permissive multistrategy that satises uncertain human preferences with respect to a multi-objective property. Experimental results on a range of large case studies show that the proposed approach is feasible and scalable across varying MDP model sizes and uncertainty levels of human preferences. Evaluation via an online user study also demonstrates the quality and benets of the synthesized controllers.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Controller synthesis-which oers automated techniques to synthesize controllers that satisfy certain properties-has been increasingly used in the design of cyber-physical systems (CPS), including applications such as semi-autonomous driving <ref type="bibr">[33]</ref>, robotic planning <ref type="bibr">[22]</ref>, and human-in-the-loop CPS control <ref type="bibr">[14]</ref>. Many complex real-world CPS applications give rise to the need for multi-objective controller synthesis, which computes an optimal controller subject to multiple (possibly conicting) criteria. Examples are synthesizing an optimal controller to maximize safety while minimizing fuel consumption for an automotive vehicle, or synthesizing an optimal robotic controller to minimize the mission completion time while minimizing the risk in disaster search and rescue. An optimal solution to multi-objective controller synthesis should account for the trade-o between multiple objective properties. There may not exist a single global solution that optimizes each individual objective property simultaneously. Instead, a set of Pareto optimal points can be computed: those for which no objective can be optimized further without worsening some other objectives.</p><p>For many applications that involve human decision-makers, they can be presented with these Pareto optimal solutions to decide which one to choose. Alternatively, humans can specify a priori their preferences about the relative importance of objectives, which are then used as weights in the multi-objective controller synthesis to compute an optimal solution based on the weighted sum of objectives. We can ask humans to assign objective weights directly; however, sometimes it can be dicult for them to come up with these values. As surveyed in <ref type="bibr">[26]</ref>, there exist many dierent approaches for eliciting human preferences, such as ranking, rating, and pairwise comparison. Various preference elicitation methods can yield dierent weight values as artifacts. Moreover, human preferences can evolve over time and vary across multiple users. Thus, there is inherent uncertainty in human preferences.</p><p>In this work, we study the problem of multi-objective controller synthesis with uncertain human preferences. To the best of our knowledge, this is the rst work that takes into account the uncertainty of human preferences in multi-objective controller synthesis. We address the following research challenges: How to mathematically represent the uncertainty in human preferences? How to account for uncertain human preferences in multi-objective controller synthesis? How to generate a succinct representation of the synthesis results? And how to evaluate the synthesized controllers? Specically, we focus on the modeling formalism of Markov decision processes (MDPs), which have been popularly applied for the controller synthesis of CPS that exhibit stochastic and nondeterministic behavior (e.g., robots <ref type="bibr">[22]</ref>, human-in-the-loop CPS <ref type="bibr">[14]</ref>). In recent years, theories and algorithms have been developed for the formal verication and controller synthesis of MDPs subject to multi-objective properties <ref type="bibr">[4,</ref><ref type="bibr">8,</ref><ref type="bibr">11,</ref><ref type="bibr">15,</ref><ref type="bibr">16,</ref><ref type="bibr">20]</ref>. However, none of the existing work takes into account the uncertainty in human preferences.</p><p>We formalize the notion of uncertain human preferences as an interval weight vector that comprises a convex set of weight vectors over objectives. Since each weight vector corresponds to some controller that optimizes the weighted sum of objectives, an interval weight vector would yield a set of controllers (i.e., MDP strategies). We adopt the notion of multi-strategy <ref type="bibr">[10]</ref> to succinctly represent a set of MDP strategies. A (deterministic, memoryless) multi-strategy species multiple possible actions in each MDP state. Thus, a multistrategy represents a set of compliant MDP strategies, each of which chooses an action that is allowed by the multi-strategy in each MDP state. We dene the soundness of a multi-strategy with respect to a multi-objective property, and an interval weight vector representing uncertain human preferences. We also quantify the permissivity of a multi-strategy by measuring the degree to which actions are allowed in (reachable) MDP states. A sound, permissive multistrategy can enable more exibility in CPS design and execution. For example, if an action in an MDP state becomes infeasible during the system execution (e.g., some robotic action cannot be executed due to an evolving and uncertain environment), then alternative actions allowed by the multi-strategy can be executed instead, still guaranteeing satisfaction of the human preferences.</p><p>We develop a mixed-integer linear programming (MILP) based approach to synthesize a sound, optimally permissive multi-strategy with respect to a multi-objective MDP property and uncertain human preferences. Our approach is inspired by <ref type="bibr">[10]</ref>, which presents an MILP-based method for synthesizing permissive strategies in stochastic games (of which MDPs are a special case). However, there are several key dierences in our encodings. First, we solve multi-objective optimization problems, while <ref type="bibr">[10]</ref> is for a single objective. Second, we have a dierent soundness denition for the multi-strategy and need to track the values of both lower and upper bounds of each objective, while <ref type="bibr">[10]</ref> only considers one direction. Lastly, we have a dierent denition of permissivity which only considers reachable states under a multi-strategy.</p><p>We evaluate the proposed approach on a range of large case studies. The experimental results show that our MILP-based approach is scalable to synthesize sound, optimally permissive multi-strategies for large models with more than 10 6 MDP states. Moreover, the results show that increasing the uncertainty of human preferences yields more permissive multi-strategies.</p><p>In addition, we evaluate the quality of synthesized controllers via an online user study with 100 participants using Amazon Mechanical Turk. The study results show that strategies synthesized based on human preferences are more favorable, perceived as more accurate, and lead to better user satisfaction, compared to arbitrary strategies. In addition, multi-strategies are perceived as more informative and satisfying than less permissive (single) strategies.</p><p>Contributions. We summarize the major contributions of this work as follows.</p><p>&#8226; We formalized the notion of uncertain human preferences, and developed an MILP-based approach to synthesize a sound, optimally permissive multi-strategy for a given multiobjective MDP property and uncertain human preferences.</p><p>&#8226; We implemented the proposed approach and evaluated it on a range of large case studies to demonstrate its feasibility and scalability.</p><p>&#8226; We designed and conducted an online user study to evaluate the quality and benets of the synthesized controllers.</p><p>Paper Organization. In the rest of the paper, we introduce some background about MDPs and multi-objective properties in Section 2, formalize uncertain human preferences in Section 3, develop the controller synthesis approach in Section 4, present experimental results in Section 5, describe the user study in Section 6, survey related work in Section 7, and draw conclusions in Section 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND</head><p>In this section, we introduce the necessary background about MDPs and multi-objective properties.</p><p>A Markov decision process (MDP) is a tuple M = ((, B 0 , , X), where ( is a nite set of states, B 0 2 ( is an initial state, is a set of actions, and X : ( &#8677; ! Dist (() is a probabilistic transition function with Dist (() denoting the set of probability distributions over (. Each state B 2 ( has a set of enabled actions, given by U (B)</p><p>) and X (B 8 , 0 8 )(B 8+1 ) &gt; 0 for all 8 0. We say that a state B is reachable if there exists a nite path starting from B 0 and ending in B as the last state. Let FPaths (IPaths) denote the set of nite (innite) paths through M.</p><p>A strategy (also called a policy) is a function f : FPaths ! Dist ( ) that resolves the nondeterministic choice of actions in each state based on the execution history. A strategy f is deterministic if f (c) is a point distribution for all c, and randomized otherwise. A strategy f is memoryless if the action choice f (c) depends only on the last state of c. In this work, we focus on deterministic, memoryless strategies. 1 Thus, we can simplify the denition of strategy to a function f : ( ! . Let &#8963; M denote the set of all (deterministic, memoryless) strategies for M. A strategy f 2 &#8963; M induces a probability measure over IPaths, denoted by Pr f M , in the standard fashion <ref type="bibr">[21]</ref>.</p><p>A reward function of M takes the form A : ( &#8677; ! R. The total reward along an innite path c = B 0 0 0 B 1 0 1 . . . is given by A (c)</p><p>We say that M under strategy f satises a reward predicate [A ] &#8672;1 where &#8672; 2 { , &#63743;} is a relational operator and 1 is a rational reward bound, denoted</p><p>, where &#249;&#251; 8 2 {min, max}, aims to minimize and/or maximize = objectives of expected total rewards simultaneously. For the rest of the paper, we assume that the multi-objective property is of the form q = ([A 1 ] min , . . . , [A = ] min ). A maximizing objective [A 8 ] max can be converted to a minimizing objective by negating rewards. Checking q on MDP M yields a set of Pareto optimal points that lie on the boundary of the set of achievable values:</p><p>We say that a point</p><p>8 for all 8 and G 9 &lt; G &#8676; 9 for some 9. A multi-objective reward predicate</p><p>The set of achievable valuesfor q is convex <ref type="bibr">[16]</ref>.</p><p>Given a multi-objective property q = ([A 1 ] min , . . . , [A = ] min ) and a weight vector w 2 R = , the expected total weighted reward sum is 1 For the types of MDP properties considered in this work, there always exists a deterministic, memoryless strategy in the solution set <ref type="bibr">[15,</ref><ref type="bibr">16,</ref><ref type="bibr">29]</ref>.  </p><p>for any strategy f 2 &#8963; M . We say that a strategy f &#8676; is optimal with respect to q and w, if</p><p>The strategy f &#8676; also corresponds to a Pareto optimal point for q <ref type="bibr">[16]</ref>.</p><p>For simplicity, in this paper, we make the assumption that an MDP has a set of end states, which are reached with probability 1 under any strategy, and have zero reward and no outgoing transitions to other states. This simplies our analysis by ensuring that the expected total reward is always nite. A variety of useful objectives for real-world applications can be encoded under these restrictions, for example, minimizing the distance, time, or incurred risk to complete a navigation task for robotic planning, or maximizing the safety and driver trust to complete a trip for autonomous driving.</p><p>Example 2.1. Figure <ref type="figure">1</ref> shows a map for urban search and rescue missions taken from the RoboCup Rescue Simulation Competition <ref type="bibr">[1]</ref>. Consider a scenario where the robot aims to nd an optimal route satisfying two objectives: (1) minimizing the travel distance to reach the rescue location, and (2) minimizing the risk of bypassing re zones. We model the problem as an MDP where each road junction in the map is represented by an MDP state. In each state, the robot can move along the road with probability 0.9 and stay put with probability 0.1 due to noisy sensors. We dene two reward functions dist and risk to measure the distance (i.e., the number of road blocks navigated) and the risk (i.e., the number of re zones bypassed), respectively. Figure <ref type="figure">2</ref> shows the Pareto curve for the multi-objective property q = ([dist] min , [risk] min ). The convex set of achievable values for q includes any point on the Pareto curve and in the area above. There are three Pareto optimal points (A, B, C) corresponding to three deterministic, memoryless MDP strategies illustrated as purple, blue, and green routes in </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">UNCERTAIN HUMAN PREFERENCES 3.1 Formalization of Preferences</head><p>Preferences are often represented as weights reecting humans' opinions about the relative importance of dierent criteria in multiobjective optimization <ref type="bibr">[26]</ref>. Following this convention, we denote a preference over = objectives as a weight vector w = (F 1 , . . . , F = ) 2 R = where F 8 0 for 1 &#63743; 8 &#63743; = and &#213; = 8=1 F 8 = 1. Such weight vectors can be obtained by eliciting human preferences in dierent ways. A naive approach is to ask for direct human input of weight values for objectives; however, it may be dicult for humans to come up with these values in practice. A popular preference elicitation method is pairwise comparison <ref type="bibr">[34]</ref>, in which humans answer queries such as: "Do you prefer objective 8 or objective 9?" for each pair of objectives. We can then derive weights (e.g., via nding eigenvalues of pairwise comparison matrices) as described in <ref type="bibr">[3,</ref><ref type="bibr">9]</ref>. There are many other methods (e.g., Likert scaling, rating, ranking) for eliciting preferences weights, as surveyed in <ref type="bibr">[26]</ref>. Eliciting preferences from the same person using various methods can yield dierent weight vectors as artifacts. In addition, if the controller synthesis needs to account for multiple human decision-makers' opinions, then a range of weight vectors can be resulted from eliciting multiple humans' preferences.</p><p>In order to capture the inherent uncertainty of human preferences, we dene uncertain human preferences as an interval weight vector w</p><p>, where F 8 (F 8 ) is the lower (upper) weight bound for objective 8, and 0 &#63743; F 8 &#63743; F 8 &#63743; 1. We say that a weight vector w belongs to an interval weight vector w, denoted w 2 w, if F 8 &#63743; F 8 &#63743; F 8 for all 8. An interval weight vector comprises a convex set of weight vectors, providing a compact representation of uncertain human preferences.</p><p>Example 3.1. Suppose w = ([0.2, 0.7], [0.5, 0.9]). Figure <ref type="figure">3</ref> shows a geometrical interpretation of uncertain human preferences represented by w. We intersect each dashed line representing the lower or upper objective bounds with the solid line representing F 1 +F 2 = 1, and obtain a pair of weight vectors (0.2, 0.8) and (0.5, 0.5) corresponding to the extreme points of the feasible solution set. We highlight in red the range of all possible weight vectors that belong to w, representing uncertain human preferences. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Multi-Strategy for MDPs</head><p>Recall from Section 2 that an optimal MDP strategy f &#8676; with respect to a multi-objective property q and a weight vector w corresponds to a Pareto optimal point that optimizes the weighted sum of objectives. Thus, an interval weight vector w representing uncertain human preferences yields a set of Pareto optimal points and corresponding MDP strategies. We will use the notion of multi-strategy from permissive controller synthesis <ref type="bibr">[10]</ref> to succinctly represent a set of strategies as follows.</p><p>A (deterministic, memoryless) multi-strategy for MDP M is a function \ : ( ! 2 , dening a set of allowed actions \ (B) &#10003; U (B) in each state B 2 (. Let &#8677; M denote the set of all multi-strategies for M. We say that a (deterministic, memoryless) strategy f complies with multi-strategy \ , denoted f C \ , if f (B) 2 \ (B) for all states B 2 (. We require that \ (B) &lt; ; for any state B that is reachable under some strategy that complies with \ .</p><p>Given a reward predicate [A ] &#8672;1 , we say that multi-strategy \ is sound with respect to [A ] &#8672;1 if M, f |= [A ] &#8672;1 for every strategy f that complies with \ . We then say that a multi-strategy is sound for an uncertain set of human preferences if it is sound with respect to upper and lower bounds on each objective induced by a set of weight intervals. More precisely, given a multi-objective property q = ([A 1 ] min , . . . , [A = ] min ) and an interval weight vector w, we say that multi-strategy \ is sound with respect to q, w if it is sound with respect to</p><p>for all 8, where <ref type="figure"/>and<ref type="figure">w</ref> denotes the set of Pareto optimal points corresponding to w. The intuition is that, due to convexity, any weight vector w 2 w must correspond to a Pareto optimal point within a space bounded by extreme points ofw . Later we develop Algorithm 1 in Section 4 to compute values of 1 8 and 1 8 .</p><p>We quantify the permissivity of multi-strategy \ by measuring the degree of actions allowed in (reachable) MDP states. Let _(\</p><p>) be a penalty function where ( \ &#10003; ( is the set of reachable states under \ . We say that a sound multi-strategy</p><p>is sound with respect to q and w}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">CONTROLLER SYNTHESIS APPROACH 4.1 Problem Statement</head><p>Given an MDP M = ((, B 0 , , X), a multi-objective property q = ([A 1 ] min , . . . , [A = ] min ), and an interval weight vector w representing uncertain human preferences, how can we synthesize an optimally permissive multi-strategy \ 2 &#8677; M that is sound with respect to q and w?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">MILP-based Solution</head><p>We present a mixed-integer linear programming (MILP) based approach to solve the above problem. We use binary variables [ B,0 2 {0, 1} to encode whether a multi-strategy \ allows action 0 2 U (B) in state B 2 ( of MDP M. We use real-valued variables `8,B and a 8,B to represent the minimal and maximal expected total reward for the 8th objective from state B, under any strategy complying with \ . We set `8,B = a 8,B = 0 for any end states in the MDP. The MILP encoding is:</p><p>subject to 8B 2 ( :</p><p>81 &#63743; 8 &#63743; =, 8B 2 (, 80 2 U (B) :</p><p>81 &#63743; 8 &#63743; =, 8B 2 (, 80 2 U (B) :</p><p>where 2 is a large scaling constant<ref type="foot">foot_0</ref> and we let d (B)</p><p>The objective function (1a) minimizes the total number of disallowed actions in all states plus the sum of expected total rewards over all objectives in the initial state. The latter serves as a tiebreaker between solutions with the same permissivity, favoring tighter reward bounds.</p><p>Constraints (1b) and (1c) enforce that no action is allowed for B if it is unreachable from any other state under the multi-strategy, and at least one action should be allowed otherwise. For the initial state B 0 , we assume that there is always an allowed incoming transition, and &#213;</p><p>(C,0) 2d (B 0 ) [ C,0 = 1. Constraints (1d) and (1e) encode the recursion for expected rewards in each step, which are trivially satised x = (G 1 , . . . , G = ) Find a Pareto optimal point for q that corresponds to w 5:</p><p>if 1 8 G 8 then 7:</p><p>if 1 8 &#63743; G 8 then 10:</p><p>end for 13: end for 14: return b when [ B,0 = 0, that is, action 0 is disallowed in state B. Constraints (1f) and (1g) guarantee that, for each objective 8, the expected total reward in the initial state under the multi-strategy satises the lower and upper bounds 1 8 and 1 8 , which are precomputed using Algorithm 1.</p><p>Given an interval weight vector w representing uncertain human preferences, Algorithm 1 (line 2) rst nds the set of extreme points in w, denoted, . This can be done by applying standard methods for nding extreme points in a convex set <ref type="bibr">[35]</ref>. Next, for each weight vector w 2 , , Algorithm 1 (line 4) nds a Pareto optimal point x = (G 1 , . . . , G = ) for q, which yields the minimal expected total weighted reward sum under any strategy of the MDP M. Here, we apply the value iteration-based method in <ref type="bibr">[16]</ref> for the computation of Pareto optimal points. Finally, Algorithm 1 (line 3-13) loops through all weight vectors in , to determine the smallest lower bound 1 8 and the greatest upper bound 1 8 of the expected total reward for each objective 8.</p><p>Example 4.1. We apply the proposed approach to synthesize an optimally permissive multi-strategy for MDP M modeled in Example 2.1 that is sound with respect to q = ([dist] min , [risk] min ) and w = ([0.2, 0.7], [0.5, 0.9]). Following Example 3.1, w gives a convex set of weight vectors with two extreme points (0.5, 0.5) and (0.2, 0.8). We also nd out that weight vectors (0.5, 0.5) and (0.2, 0.8) correspond to Pareto optimal points B and C in Figure <ref type="figure">2</ref>, respectively. Thus, applying Algorithm 1 yields an interval vector b = ([6.66, 8.89], [0, 1.12]) for the expected total reward bounds over q.</p><p>The MILP encoding minimizes</p><p>8=1 (a 8,B 0 `8,B 0 ). We can select 2 = 1000 as the scaling factor constant in this example.</p><p>Constraints (1b) and (1c) are instantiated, for example, for the initial state B 0 as: Constraints (1d) and (1e) are instantiated, for example, for the rst objective [dist] min , state B 0 , and action west as:</p><p>Constraints (1f) and (1g) are instantiated, for example, for the rst objective [dist] min as: `1,B 0 6.66 and a 1,B 0 &#63743; 8.89.</p><p>The MILP encoding uses 15 binary variables to encode [ B,0 , 44 real-valued variables to encode `8,B and a 8,B , and a total number of 90 constraints. It takes less than 1 second to solve the MILP problem using the Gurobi optimization toolbox <ref type="bibr">[18]</ref>. The solution yields a multi-strategy as illustrated by the orange lines in Figure <ref type="figure">4</ref>. The synthesized multi-strategy is sound with respect to q and w. There are two strategies complying with the multi-strategy, corresponding to Pareto optimal points B and C in Figure <ref type="figure">2</ref>. The multi-strategy is also optimally permissive. Such a permissive multi-strategy could be useful in assisting humans' decision-making, by informing them about multiple allowable action choices in states. In addition, it oers exibility for the system execution. If the robot nds that certain action cannot be executed due to the evolving environment (e.g., re spreading), it may execute an alternative actions allowed by the multi-strategy while still guaranteeing soundness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Correctness</head><p>The correctness of our proposed approach, with respect to the problem statement in Section 4.1, is stated below and the proof is given in the appendix. T 4.1. Let M be an MDP, q = ([A 1 ] min , . . . , [A = ] min ) be a multi-objective property and w be an interval weight vector representing uncertain human preferences. There is a sound, optimally permissive multi-strategy \ in M with respect to q and w whose permissive penalty is _(\ ), if and only if there is an optimal assignment to the MILP instance from (1a)-(1g) which satises  MILP solvers work incrementally to synthesize a series of sound multistrategies that are increasingly permissive. Therefore, we may stop early to accept a sound (but not necessarily optimally permissive) multi-strategy if computational resources are limited.</p><p>Prior to the MILP solution, we need to execute Algorithm 1, the most costly step of which is the computation of a Pareto optimal point in line 4. This is performed |, | times, where |, | is exponential in the number of objectives =. For each point, we compute a minimal weighted sum of expected total rewards for a given weight vector. This is done using the value iteration-based method of <ref type="bibr">[16]</ref>. Value iteration does not have a meaningful time complexity, but is faster and more scalable than linear programming-based techniques in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EXPERIMENTAL RESULTS</head><p>We have built a prototype implementation of the proposed approach, which uses the PRISM model checker <ref type="bibr">[23]</ref> for computing Pareto optimal results of multi-objective synthesis in MDPs, and the Gurobi optimization toolbox <ref type="bibr">[18]</ref> for solving MILP problems. The experiments were run on a laptop with a 2.8 GHz Quad-Core Intel Core i7 CPU and 16 GB RAM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Case Studies</head><p>We applied our approach to three large case studies. <ref type="foot">3</ref> . For each case study, we used two interval weight vectors representing preferences with dierent uncertainty levels.</p><p>The rst case study is adapted from <ref type="bibr">[14]</ref>, which considers the control of an unmanned aerial vehicle (UAV) that interacts with a human operator for road network surveillance, with a varying model parameter to count the operator's workload and fatigue level that may lead to degraded mission performance. The controller synthesis aims to balance two objectives of mission completion time and risk, based on the specied uncertain human preferences.</p><p>The second case study considers a task-graph scheduling problem inspired by <ref type="bibr">[28]</ref>. The controller synthesis aims to compute an optimal schedule for a set of dependent tasks based on human preferences of dierent processors, with a varying model parameter of the digital clock counter.</p><p>The third case study models a team formation protocol <ref type="bibr">[5]</ref> where a varying number of sensing agents cooperate to achieve certain tasks. The controller synthesis seeks to nd an optimal schedule for these agents to meet the objectives of completing dierent tasks based on human preferences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Results Analysis</head><p>Table <ref type="table">1</ref> shows experimental results for these case studies. For each case study, we report the size of the MDP models in terms of the number of states and transitions, the size of the resulting MILP problems in terms of the number of decision variables (binary and real-valued) and constraints, the runtime for solving the MILP, and the number of permissive states (i.e., those with more than one allowed actions) in the synthesized controllers. We set a time-out of one hour for solving the MILP.</p><p>Unsurprisingly, the size of MILP problems increases with larger MDP models. But the results demonstrate that our approach can scale to large case studies. For example, it takes less than one minute to solve the resulting MILP problem of "uav 20" model with 113,901 MDP states, which includes 119,897 binary variables, 455,604 real-valued variables, and 707,394 constraints in the MILP. In most cases of "uav" and "taskgraph", a sound, optimally permissive multi-strategy is synthesized within one minute. However, the MILP solver failed to produce a feasible solution before time-out for some "teamform" cases, despite smaller MDP models than "uav" and "taskgraph". In addition, we observed that increasing the uncertainty level of preferences (i.e., larger intervals) leads to synthesized controllers with larger numbers of permissive states.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">USER STUDY</head><p>We designed and conducted an online user study <ref type="foot">4</ref> to evaluate the synthesized controllers. We describe the study design in Section 6.1 and analyze the results in Section 6.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Study Design</head><p>Participants. We recruited 100 individuals with a categorical age distribution of 6 <ref type="bibr">(18)</ref><ref type="bibr">(19)</ref><ref type="bibr">(20)</ref><ref type="bibr">(21)</ref><ref type="bibr">(22)</ref><ref type="bibr">(23)</ref><ref type="bibr">(24)</ref>; 58 <ref type="bibr">(25)</ref><ref type="bibr">(26)</ref><ref type="bibr">(27)</ref><ref type="bibr">(28)</ref><ref type="bibr">(29)</ref><ref type="bibr">(30)</ref><ref type="bibr">(31)</ref><ref type="bibr">(32)</ref><ref type="bibr">(33)</ref><ref type="bibr">(34)</ref>; 28 (35-49); 6 (50-64); and 1 (65+) using Amazon Mechanical Turk (AMT). To ensure data quality, our study recruitment criteria required that participants must be able to read English uently and had performed at least 50 tasks previously with an above 90% approval rate on AMT. In addition, we injected attention check questions periodically during the study and rejected any response that failed attention checks. Procedure. For each participant, we described the study purposes and asked them to consent to the study. After we asked about basic demographic information (e.g., age), the rest of the study consists of two phases: (i) eliciting human preferences, and (ii) evaluating the synthesized controllers.</p><p>First, we presented to each participant a grid map shown in Figure <ref type="figure">5</ref> and asked them to consider the planning problem for a robot to navigate from the start grid to the destination with three objectives: (1) minimizing the travel distance, (2) minimizing the risk encountered on route, and (3) maximizing the number of packages collected along the way. We used four dierent methods to elicit each participant's preferences over these objectives, including direct input of weight values (as illustrated in Figure <ref type="figure">6a</ref>), Likert scaling (Figure <ref type="figure">6b</ref>), pairwise comparison of objective names (Figure <ref type="figure">6c</ref>), and pairwise comparison of optimal routes for individual objectives (Figure <ref type="figure">6d</ref>). As described in Section 3, we can derive a weight vector over objectives from the results of each preference elicitation method. Thus, by aggregating these four weight vectors resulting from dierent elicitation methods, we obtained an interval weight vector to represent each participant's preferences.</p><p>Next, based on the elicited human preferences, we applied the proposed approach to synthesize optimal robotic controllers for three dierent grid maps (including Figure <ref type="figure">5</ref> and two other similar maps). We randomized the order of maps for dierent particiants   to counterbalance the ordering confound eect. For each map, we asked participants a set of questions to evaluate the synthesized controllers. We describe the evaluation design, including manipulated factors, dependent measures, and hypotheses as follows.</p><p>Manipulated factors and dependent measures. We performed a within-subject experiment in which all participants were exposed to all evaluation conditions. We manipulated two independent factors: preferences and permissivity for the controller synthesis. For each map, we rst presented a pair of MDP strategies (visualized as plans in the grid map) side by side: one is a sound strategy synthesized based on the elicited preferences, and the other is an arbitrary strategy unsound for preferences. Figure <ref type="figure">7</ref> shows the list of evaluation questions. We asked participants about their satisfaction and perceived accuracy of each plan. We also asked them to choose which plan they preferred.</p><p>Then, we presented side by side a strategy (visualized as a single route plan) and a multi-strategy (visualized as a possible multiple route plan), which are both synthesized on the elicited preferences but with dierent degrees of permissivity. We asked participants to compare the synthesized strategy and multi-strategy in terms of favor ("Which route do you like better?"), informativity ("Which route provides more information?"), and satisfaction ("Which route are you more satised with?"). The exact questionnaire can be found in Figure <ref type="figure">8</ref>. Hypotheses. We made the following hypotheses based on the two manipulated factors.</p><p>Comparing strategies synthesized based on the elicited preferences and arbitrary strategies:</p><p>&#8226; H1: Preference-based strategies are more favorable than unsound arbitrary strategies.</p><p>&#8226; H2: Preference-based strategies are perceived as more accurate than unsound arbitrary strategies.</p><p>&#8226; H3: Preference-based strategies yield better satisfaction than unsound arbitrary strategies.</p><p>Comparing strategies and multi-strategies synthesized based on the elicited preferences:</p><p>&#8226; H4: Multi-strategies are more favorable than strategies.</p><p>&#8226; H5: Multi-strategies are perceived as more informative than strategies.</p><p>&#8226; H6: Multi-strategies yield better satisfaction than strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Results Analysis</head><p>Comparing preference-based and arbitrary strategies. To evaluate hypothesis H1, we utilize a chi-squared test <ref type="bibr">[32]</ref> to prove the statistical signicance in the frequency of strategy selection, assuming an expected frequency of 50/50 to represent a random selection of strategies by users. We use an alpha value of 0.05 and thus retain a condence level of 95% for our hypotheses. We assume a null hypothesis that the user selection of strategies will be random. We nd that users favor preference-based strategies about 63% of the time overall (j 2 :U= 0.05, j 2 = 21.33, CritVal = 3.84, p&#63743;0.00001, Signicant.); they choose preference-based strategies over arbitrary strategies more often for all three maps (71%, 59%, 60%). Thus, the data supports H1.</p><p>To evaluate hypotheses H2 and H3, shown in Figure <ref type="figure">9</ref> we employ one-way repeated measures ANOVA tests <ref type="bibr">[32]</ref> to prove the statistical signicance of the mean of all responses between preferencebased strategies and arbitrary strategies. We use an alpha value of  0.05 and assume a null hypothesis that users will perceive preference accuracy and be satised with both strategies at a similar rate. We nd that users rated preference-based strategies as signicantly more accurate to their objective preferences (rANOVA:U= 0.05, F(1,598) = 74.71, p&#63743;0.00001, Signicant.). Figure <ref type="figure">9</ref> also shows that  users were signicantly more satised with preference-based strategies than another arbitrary strategy through the plan (rANOVA:U= 0.05, F(1,598) = 105.28, p&#63743;0.00001, Signicant.). Thus, the data supports H2 and H3. Comparing strategies and multi-strategies. We use chi-squared tests with an expected frequency of 50/50 and an alpha value of 0.05 to evaluate hypotheses H4, H5, and H6 with the study results shown in Figure <ref type="figure">10</ref>.</p><p>Column 1 (Overall Favor) of Figure <ref type="figure">10</ref> shows that users do not signicantly favor multi-strategies over less permissive strategies (j 2 :U= 0.05, j 2 = 0.12, CritVal = 3.84, p&#63743;0.729, Not Signicant.), only slightly favoring multi-strategies to a single strategy counterpart. Thus, the data rejects H4.</p><p>Column 2 (Informativity) of Figure <ref type="figure">10</ref> shows users agreed about 71% of the time that multi-strategies provided them more information (j 2 :U= 0.05, j 2 = 54.61, CritVal = 3.84, p&#63743;0.00001, Signicant.). Thus, the data supports H5.</p><p>Column 3 (Satisfaction) of Figure <ref type="figure">10</ref> shows users were more satised with multi-strategies 56% of the time (j 2 :U= 0.05, j 2 = 4.32, CritVal = 3.84, p&#63743;0.038, Signicant.). Thus, the data supports H6.</p><p>Summary. We accept all hypotheses except H4 based on the statistical analysis. The user study results show that it is benecial to synthesize strategies that account for human preferences. In addition, multi-strategies are more informative and yield better user satisfaction. However, sometimes less is more, participants do not always favor multi-strategies over strategies that are simpler to understand.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">RELATED WORK</head><p>Human preferences. Mathematical models of human preferences have been studied broadly in the eld of social choice theory <ref type="bibr">[2]</ref>. There are many dierent representations of human preferences, for example, encoded as reward functions for robot trajectory planning <ref type="bibr">[31]</ref> and deep reinforcement learning <ref type="bibr">[6]</ref>, or specied using temporal logics <ref type="bibr">[25,</ref><ref type="bibr">27]</ref>. In the context of multi-objective optimization <ref type="bibr">[26]</ref>, preferences are represented as weights indicating the relative importance of objectives. Optimization methods can vary depending on when and how humans articulate their preferences. Humans can indicate their preferences a priori before running the optimization algorithm, they can progressively provide input during the optimization process, or they can select a posteriori a solution point from a set of Pareto optimal results. Our work considers a priori elicitation of human preferences represented as weights for multiple objectives. Multi-objective controller synthesis for MDPs. Multi-objective optimization has been well-studied in operation research and engineering <ref type="bibr">[26,</ref><ref type="bibr">30]</ref>. In recent years, multi-objective optimization for MDPs has been considered from a formal methods perspective <ref type="bibr">[4,</ref><ref type="bibr">8,</ref><ref type="bibr">11,</ref><ref type="bibr">15,</ref><ref type="bibr">16,</ref><ref type="bibr">20]</ref>, which presents theories and algorithms for verifying multi-objective properties, synthesizing strategies, and approximating Pareto curves. More recently, such techniques have been applied to multi-objective robot path planning <ref type="bibr">[24]</ref> and multi-objective controller synthesis for autonomous systems that account for human operators' workload and fatigue levels <ref type="bibr">[14]</ref>. However, existing work does not account for the uncertainty of human preferences in the relative importance of objectives.</p><p>There is a line of work (e.g., <ref type="bibr">[7,</ref><ref type="bibr">19,</ref><ref type="bibr">37]</ref>) considering uncertain MDPs where transition probabilities and rewards are represented as an uncertain set of parameters or intervals. Our work is dierent in the sense that we consider the uncertainty in human preferences of dierent objectives.</p><p>Our proposed approach is based on mixed-integer linear programming (MILP). There exist several MILP-based solutions to compute counterexamples and witnesses for MDPs <ref type="bibr">[12,</ref><ref type="bibr">13,</ref><ref type="bibr">17,</ref><ref type="bibr">36]</ref>. However, these methods are not directly applicable for controller synthesis which is a dierent problem. The most relevant work is <ref type="bibr">[10]</ref> that presents an MILP-based method for permissive controller synthesis of probabilistic systems. As discussed in Section 1, our approach is inspired by <ref type="bibr">[10]</ref> but has several key dierences (e.g., <ref type="bibr">[10]</ref> does not consider the controller soundness with respect to multi-objective properties and human preferences).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">CONCLUSION</head><p>In this paper, we developed a novel approach that accounts for uncertain human preferences in the multi-objective controller synthesis for MDPs. The proposed MILP-based approach synthesizes a sound, optimally permissive multi-strategy with respect to a multiobjective property and an uncertain set of human preferences. We implemented and evaluated the proposed approach on three large case studies. Experimental results show that our approach can be successfully applied to synthesize sound, optimally permissive multi-strategies with varying MDP model size and uncertainty level of human preferences. In addition, we designed and conducted an online user study with 100 participants using Amazon Mechanical Turk, which shows statistically signicant results about user satisfaction of the synthesized controllers.</p><p>There are several directions to explore for possible future work. First, we will extend our approach for a richer set of multi-objective properties beyond expected total rewards, such as the temporal logic-based multi-objective properties considered in <ref type="bibr">[15]</ref>. Second, we will extend our approach for a variety of probabilistic models beyond MDPs, such as stochastic games and POMDPs. Last but not least, we will apply our approach to a wider range of real-world CPS applications (e.g., autonomous driving, smart cities).</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Constant 2 is chosen to be larger than the expected total reward for any objective, from any state and under any objective.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>Files are available from: hps://www.prismmodelchecker.org/files/iccps22</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>This user study has obtained the Institutional Review Board (IRB) approval.</p></note>
		</body>
		</text>
</TEI>
