<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Model-based motion planning in POMDPs with temporal logic specifications</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/18/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10464617</idno>
					<idno type="doi">10.1080/01691864.2023.2226191</idno>
					<title level='j'>Advanced Robotics</title>
<idno>0169-1864</idno>
<biblScope unit="volume">37</biblScope>
<biblScope unit="issue">14</biblScope>					

					<author>Junchao Li</author><author>Mingyu Cai</author><author>Zhaoan Wang</author><author>Shaoping Xiao</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Partially observable Markov decision processes (POMDPs) have been used as mathematical models for sequential decision-making under uncertain and incomplete information. Since the state space is partially observable in a POMDP, the agent has to make a decision based on the integrated information over the past experiences of actions and observations. This study aims to solve probabilistic motion planning problems in which the agent is assigned a complex task under a partially observable environment. We employ linear temporal logic (LTL) to formulate the complex task and then convert it to a limit-deterministic generalized Büchi automaton (LDGBA). We reformulate the problem as finding an optimal policy on the product of POMDP and LDGBA based on model-checking techniques. This paper adopts and modifies two reinforcement learning (RL) approaches: value iteration and deep Q-learning. Both are model-based because the optimal policy is a function of belief states that need transition and observation probabilities to be updated. We illustrate the applicability of the proposed methods by addressing two simulations, including a grid-world problem with various sizes and a TurtleBot office path planning problem.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Markov decision processes (MDPs) <ref type="bibr">[1]</ref>havebeenwidely applied in robotics motion planning, assuming the environment was fully observable. However, in some realworld applications, the agent may not have access to complete or reliable information about the state of the environment. Therefore, partially observable Markov decision processes (POMDPs) <ref type="bibr">[2]</ref> need to be adopted. POMDPs are particularly useful when sensors or perception systems are prone to errors or uncertainty. By incorporating the observation probability model into the POMDP framework, it is possible to quantify the effects of perception errors on the agent's decision-making performance. On the other hand, there has been increasing interest in considering complex tasks in robotics motion planning other than simple go-to-goal missions, especially dealing with uncertain and dynamic environments.</p><p>Reinforcement learning (RL) methods <ref type="bibr">[3]</ref>h a v ebe e n employed to solve robotics motion planning in partially observable environments. Model-based RL approaches assume that the agent knows the probabilistic characteristics of transition and observation. Therefore, a belief s t a t ec a nb eu p d a t e dv i at h eB a y e s i a nt h e o r y <ref type="bibr">[ 2,</ref><ref type="bibr">4]</ref>t o denote a probability distribution over all possible states. Consequently, a POMDP problem with the original state space becomes an MDP problem with a corresponding CONTACT Junchao Li junchao-li@uiowa.edu belief space. In this case, the policy is a function of belief states for action selection. One commonly-used approach is the value iteration algorithm to solve POMDP problems as a form of dynamic programing.</p><p>Early works aim to determine exact value functions in the belief space for small POMDP problems, e.g. enumeration algorithms <ref type="bibr">[2,</ref><ref type="bibr">5]</ref>.Basedo nSo ndik'so ne-pas s algorithm <ref type="bibr">[5]</ref>, Cheng <ref type="bibr">[6]</ref> proposed a simple linear support algorithm, which started at a random belief state and generated a vector for its value function. Also, the witness algorithm <ref type="bibr">[7]</ref> simplified the POMDP problem by considering one action and one observation at a time. Furthermore, Zhang and Liu <ref type="bibr">[8]</ref> integrated the enumeration a n dwi tn es salg o ri thm sa n dp r o posedaso-calledin cr em e n t a lp r u n i n ga l g o r i t h m .H o w e v e r ,i nl a r g eP O M D P problems with complex dimensions, finding the optimal policy precisely in the belief space can be computationally intensive. Therefore, point-based value iteration (PBVI) algorithms <ref type="bibr">[9]</ref> were proposed for infinite-horizon POMDP problems by approximating optimally reachable belief spaces to address this issue. This approach updated value functions locally for a finite subset of the belief space. One of the point-based solvers <ref type="bibr">[10]</ref>, SAR-SOP (Successive Approximations of the Reachable Space under Optimal Policies) <ref type="bibr">[11,</ref><ref type="bibr">12]</ref>, could handle POMDP problems with large state spaces.</p><p>Another approach is employing Q-learning to learn the optimal policy on the belief state space of a POMDP. It shall be noted that Q-learning <ref type="bibr">[13]</ref>i sam o d e l -f r e e approach that allows the agent to learn to act optimally in MDP problems. However, since the belief states need to be calculated from the transition and observation probabilities, this approach is still model-based. On the other hand, because the belief state spaces are usually continuous in POMDP problems, classical Q-learning algorithms such as tabular Q-learning <ref type="bibr">[13]</ref>a r en ol o n g e r applicable. Consequently, deep Q-network <ref type="bibr">[14]</ref>( D Q N ) is employed in this paper to map POMDP belief states to state-action values. Some other works <ref type="bibr">[15,</ref><ref type="bibr">16]</ref>h a v e utilized DQN to achieve the optimal policy regarding a sequence of observations. However, they were model-free approaches to POMDPs, which is not our focus in this study.</p><p>Manyworkshavebeendonetoconsidercomplextasks o t h e rt h a ns i m p l eg o -t o -g o a lm i s s i o n si nM D Pp r o blems by adopting formal languages <ref type="bibr">[17]</ref>, such as linear temporal logic (LTL), to formulate user-defined highlevel specifications. Then, the LTL formula is commonly c o n v e r t e dt oa n&#969;-automaton over infinite words with a B&#252;chi or a Rabin acceptance condition <ref type="bibr">[18]</ref>. Consequently, robotics motion planning problems can be solved via control synthesis for a product of MDP and automaton with model checking <ref type="bibr">[17]</ref>. It has been shown that a limit-deterministic B&#252;chi automaton (LDBA) has more advantages than a deterministic Rabin automaton (DRA). Hahn et al. <ref type="bibr">[19]</ref>implementedLDBAandDRAin model-free RL with mild restrictions, respectively. They c o n c l u d e dt h a tL D B Aw a sm o r es u i t a b l ef o rb o t hq u a litative and quantitative analysis of MDPs under all &#969;regular objectives. In another work, Hasanbeig et al. <ref type="bibr">[20]</ref> stated that converting the LTL specification into an LDBA might result in a smaller automaton state space than a DRA. However, LDBA suffers the sparsity of reward and may highly slow the RL's convergence <ref type="bibr">[21]</ref>.</p><p>Similarly, model checking provides the formal approach to verifying complex task objectives when solving POMDP problems with certain temporal logic constraints. Chatterjee et al. <ref type="bibr">[22]</ref> studied the undecidability of the qualitative model checking in a POMDP problem with the infinite horizon. They proposed a finitememory approach for the verification and synthesis of POMDPs with LTL constraints. It relied on exploring the entire belief space and was most suitable to the problems with small state spaces. In other works <ref type="bibr">[23,</ref><ref type="bibr">24]</ref>, LTL specifications were converted to a DRA, and then a p r od ucto fPO MD Pa ndD RAwasco n structed.S pecifically, Sharan et al. <ref type="bibr">[23]</ref> employed finite state controllers to limit the policy search via the value iteration methods. On the other hand, Bouton et al. <ref type="bibr">[24]</ref>u t i l i z e dt h e approximate POMDP solver, SARSOP <ref type="bibr">[11]</ref>, to search for an optimal policy on the finite belief state space of the product POMDP. Finally, it shall be noted that Wang et al. <ref type="bibr">[25]</ref>e m p l o y e dL D B Aa n dc o n v e r t e daP O M D Pt o the corresponding belief MDP before building a product of belief MDP and LDBA. However, simulation examples are expected to demonstrate the feasibility of the proposed method.</p><p>T h efi r s tc o n t r i b u t i o no ft h i sp a p e ri sf o r m u l a ting a complex task via LTL, converting it to a limitdeterministic generalized B&#252;chi automaton (LDGBA), and constructing a product of POMDP with LDGBA. Although the strategy of representing complex tasks by LTL and then automaton is not new, most works focused on MDP instead of POMDP problems. In addition, this studyemploysLDGBAforthefirsttimeinPOMDPprobl e m s ,s i n c ei tc a nl e a dt oas m a l l e rp r o d u c ts t a t es p a c e <ref type="bibr">[19]</ref>t h a nD R A <ref type="bibr">[ 23,</ref><ref type="bibr">24]</ref> and can address the sparsity of rewards caused by LDBA <ref type="bibr">[21]</ref>. The second contribution is to reformulate the original problem of finding a policy that satisfies LTL specifications in a POMDP as finding an optimal policy to maximize the collected reward on the corresponding product of POMDP and LDGBA. Therefore, model-based RL approaches can be employed. In addition to a modified PBVI solver, we propose convolutional neural network (CNN)-enhanced deep Q networks to approximate the state-action value functions at a given belief product state. The last contribution is redesigning the reward function and introducing a frontier set in LDGBA to record non-visited accepting sets so that the proposed methods can efficiently handle surveillance tasks.</p><p>Model-predictive Control (MPC), Rapidly-exploring Random Trees (RRT), and Probabilistic Roadmaps (PRM) are classical motion and trajectory planning algorithms. Recently, some works like <ref type="bibr">[26]</ref>u t i l i z e dR R Tt o map a belief state to possible paths that connect the start state to a goal state for pathfinding. Therefore, RRT (and PRM) can address the observation uncertainty. Although this approach could be extended to solve the product POMDP problems proposed in this study to handle complex tasks, no such works were reported based on the authors' best knowledge. On the other hand, the belief state space in POMDP is infinite. When model-based motion planning algorithms, such as RRT and PRM, search the paths on belief state spaces, they can be computationally expensive. On the other hand, an online M P Cs o l v e rw i l lb em o r ea p p r o p r i a t ew h e nt h eo b s e rvation uncertainty is considered. However, the optimal control policy needs to be updated frequently based on the new measurements. This method can be computat i o n a l l yi n t e n s i v e ,t o o .C o m p a r e dw i t hM P C ,R R T ,a n d PRM, one of our approaches utilizes the SARSOP solver, which explores a belief tree on the optimally reachable belief space to improve computational efficiency.</p><p>The organization of this paper is described as follows. Section 2 reviews POMDP problems and stateof-art solutions, including PBVI/SARSOP and Deep Qlearning. Section 3 introduces LTL, LDGBA, and product POMDP. Then, we redefine the POMDP problem with LTL specifications and depict reward design and tracking frontier function. Section 4 proposes the approaches to product POMDP problems and provides detailed algorithms. Finally, two simulations and results are included in Section 5,followedbytheconclusionandfutureworks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">POMDP and its solution</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">POMDP</head><p>The POMDP is a mathematical framework to model a problem in which the environment is not fully observable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 2.1 (POMDP):</head><p>Considering the action and observation uncertainties and the labels of states, a POMDP can be denoted by a tuple P = (S, A, T, s 0 , R, O, , , L),where:</p><p>&#8226; S ={s 1 , ..., s n } is a finite set of states.</p><p>&#8226; A ={a 1 , ..., a m } is a finite set of actions. Specifically, A(s) represents the set of available actions that can be takenbytheagentatstates. At every time step during the learning, the agent is at state s &#8712; S and selects an action a &#8712; A(s),whichtransits the agent to the next state s &#8242; &#8712; S. Since we consider the motion uncertainty, the probability of such a transition is T(s, a, s &#8242; ). After the agent reaches the next state (s &#8242; ), it then can perceive an observation o &#8712; O(s &#8242; ) with the probability of (s &#8242; , a, o) to gather the information of this state. On the other hand, after the transition, the agent also collects a reward given by the function R(s &#8242; ).T h eg o a lo f the agent choosing a sequence of actions is to maximize its expected return, i.e. the accumulated rewards, starting from state s at the current time t = 0as</p><p>where s t denotes the agent's state at time t. &#947; &#8712; [0, 1] is the discount factor to balance the importance between immediate and future rewards.</p><p>In model-based approaches to solving POMDPs, it is common to find an optimal policy in the belief state space instead of the state space defined in the POMDP, i.e. S. A belief state, b &#8712; B where B = Dist(S) is the belief state space, represents the probability distribution over all the possible states s &#8712; S. Specifically, b t (s) denotes the probability of the agent being at state s at time step t.T h e initial belief state, b 0 , depends on the agent's knowledge of its initial state s 0 . If the agent is aware of its initial state, b 0 (s 0 ) = 1a n db 0 (s) = 0f o rs = s 0 .O t h e r w i s e ,b 0 is a uniform probability distribution. If the agent's current belief state is b t (s),a f t e rt a k i n ga c t i o na t and obtaining observation o t+1 ,thenewbeliefstateb t+1 (s) can be updated by Icarte et al. <ref type="bibr">[27]</ref>:</p><p>Consequently, the belief state holds the experience of a complete history <ref type="bibr">[27]</ref>, and a policy can map the belief state b t &#8712; B to the action a t &#8712; A.T h e n ,t h ee x p e c t e d return in (1) under a policy &#958; can be rewritten as</p><p>where R(b t ) = s&#8712;S b t (s)R(s) i st h er e w a r dt h a tt h e agent can collect depending on the current belief state. Finally, the POMDP problem becomes finding the optimal policies in a corresponding MDP with the belief state space to maximize the expected return, i.e. &#958; * = argmax &#958; U &#958; (b).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">PBVI and deep Q-learning</head><p>Since the belief state space is infinite and most belief points are very unlikely to be reached, the PBVI algorithms <ref type="bibr">[9]</ref> prune away belief points by selecting the best action on every trajectory to generate the next belief point. Specifically, the PBVI algorithms introduce a set of alpha vectors V ={&#945; 0 , &#945; 1 , ..., &#945; m } to approximate the value functions as <ref type="bibr">[9]</ref>:</p><p>It shall be noted that each alpha vector is an |S|dimensional hyper-plane and constrained by the boundary of the belief state space, i.e. B.Anumberofalphavectors form the approximated value function V(b),which is a piece-wise linear and convex function <ref type="bibr">[9]</ref>. Only one alpha vector on each belief point is maintained during the valuebackup,whichisdefinedas[9]</p><p>In addition, the PBVI algorithms use a precision parameter, the difference between an upper bound and a lower bound of the value function, to guarantee policy convergence <ref type="bibr">[9]</ref>. Another reported approach <ref type="bibr">[14]</ref>toPOMDPproblems is utilizing deep neural networks (DNNs) to approximate the state-action values. Mnih et al. <ref type="bibr">[ 16]</ref>fi r s t l yi n t r oduced Deep Q-learning (DQN) to solve MDP problems. Their DQN architecture consists of two DNNs (called Qnetworks), an evaluation Q-network Q e (s, a; &#952; e ) and a target Q-network Q t (s, a; &#952; t ) to keep the learning more stable and effective. During the learning process, &#491;-greedy exploration strategy <ref type="bibr">[3]</ref> and experience replay memory <ref type="bibr">[28]</ref> are used for the action selection and the collection of the training samples, respectively. M. Egorov <ref type="bibr">[14]</ref> leveraged DQN to solve POMDP problems by mapping POMDP beliefs to the state-action values. Instead of the model-free algorithm commonly used in MDP problems, his approach updates the belief state as the input feature to Q networks. Consequently, this approach is still model-based, and the derived policy is a function of belief states.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Linear temporal logic and product POMDP</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Linear temporal logic (LTL)</head><p>Linear temporal logic is a high-level language to describe user-specified tasks. Basically, an LTL specification can be formulated inductively via the combination of Boolean operators, such as negation (&#172;) and conjunction (&#8743;), and two temporal operators, including next ( )a n du n t i l (U ). The formula &#966; c a nb er e a da s' &#966; is true at the next state' while &#966; 1 U&#966; 2 as '&#966; 2 is true at some future states and &#966; 1 is true at each state until then.' Consequently, the syntax of an LTL formula is defined inductively as <ref type="bibr">[29]</ref> &#966;</p><p>where a &#8712; is an atomic proposition. Other common Boolean and temporal operators are derived as follows:</p><p>or :</p><p>Let |= denote the satisfaction relationship. The semantics of an LTL formula is interpreted over words, which is an infinite sequence w = w 0 w 1 ... with w i &#8712; 2 for all i &#8805; 0, and defined as:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Limit-deterministic generalized B&#252;chi automaton (LDGBA)</head><p>GivenanLTLthatspecifiesacomplextask,itssatisfaction can be evaluated by an LDGBA <ref type="bibr">[30]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 3.1 (LDGBA):</head><p>An LDGBA is a tuple A = (Q, , &#948;, q 0 , F),where Q is a finite set of states, = 2 i safi n i t ea l p h a b e t ,&#948; :</p><p>&#8226; The state transitions in Q D are total and restricted within it, i.e. |&#948;(q, &#945;)|=1and&#948;(q, &#945;) &#8838; Q D for every state q &#8712; Q D and &#945; &#8712; , &#8226; The &#491;-transition is not allowed in the deterministic set, i.e. for any q &#8712; Q D , &#948;(q, &#491;) =&#8709;, &#8226; The &#491;-transitions are only defined for state transitions from Q N to Q D ,w h i c hd on o tc o n s u m et h ei n p u t alphabet, and &#8226; The accepting sets are only in the deterministic set, i.e.</p><p>A run of an LDGBA, subject to an input word w = w 0 w 1 ...,canberepresentedasq = q 0 q 1 ...,andinf(q) represents the infinite portion of q. If there exists inf(q) &#8745; F i =&#8709;, &#8704;i &#8712;{1, ...f },w es a yt h a tq satisfies the LDGBA acceptance condition. In other words, the LDGBA accepts the word w.W er e c o m m e n dO w l <ref type="bibr">[ 18]</ref> to readers for more details about automaton generation.</p><p>Consequently, this study aims to solve the POMDP problems with LTL specifications defined below.</p><p>Problem 3.1: Given a POMDP with its belief state space B t h a tc a nb ed e r i v e dv i a( 2 )a n dat a s ke x p r e s s e da s a nL T Lf o r m u l a .T h eo b j e c t i v ei st ofi n dap o l i cy&#958; * (b) that can complete the task by satisfying the acceptance condition of the LTL-induced LDGBA.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Product POMDP</head><p>Therefore, we introduce a framework for solving a POMDP problem by exploiting the fact that an LTL formula can be transformed into an LDGBA representing the task variables and safety constraints of the POMDP.</p><p>The problem of satisfying a given LTL objective &#966; in a POMDP P canbereducedtotheproblemofsatisfyinga repeated reachability (B&#252;chi) objective &#966; B in the product POMDP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 3.2 (Product POMDP):</head><p>The product POM DP P &#215; = P &#215; A of a POMDP P = (S, A, T, s 0 , R, O, , , L) and an LDGBA A = (Q, , &#948;, q 0 , F) is defined as</p><p>where:</p><p>&#8226; S &#215; = S &#215; Q is the finite set of labeled states, i.e. s &#215; = s, q &#8712;S &#215; where s &#8712; S and q &#8712; Q.</p><p>T(s, a, s &#8242; ) q &#8242; = &#948; q, l , l &#8712; L(s &#8242; ) and a &#8712; A 1 a &#215; &#8712; {&#491;} and q &#8242; &#8712; &#948;(q, &#491;) and s &#8242; = s, 0o t h e r w i s e .</p><p>(7) &#8226; s &#215; 0 = s 0 , q 0 &#8712;S &#215; is the initial state, where s 0 &#8712; S and</p><p>where s &#8242; &#8712; S and a &#215; &#8712; A. On the other hand, if a &#215; &#8712; {&#491;}, the agent stays at the same s, i.e. s &#8242; = s,a n dt h e belief state won't be updated.</p><p>A random path (s 0 , q 0 )(s 1 , q 1 )...of the product POMDP corresponds uniquely to the combination of a path s 0 , s 1 ... of the POMDP and a path q 0 , q 1 ... of the LDGBA. On the other hand, the belief product state denoted as b &#215; &#8712; B &#215; ,w h e r eB &#215; = Dist(S &#215; ),r e p r e s e n t s the probability distribution over all the possible product POMDP states. Similarly, the new belief product state can be updated by the formula derived from (2):</p><p>It is noted that the product POMDP shares the same observation space of POMDP. The initial belief state is</p><p>The expected return under a policy &#958; &#215; can be written as below, similar to <ref type="bibr">(1)</ref>.</p><p>where the reward function remains as</p><p>The constructed product POMDP P &#215; can be interpreted as the POMDP P with the augmented state space to account for the temporal logic specifications represented by LDGBA A. All the feasible paths on P &#215; share the intersections between all the accessible paths over P and all words accepted by A. Specifically, a path &#963; &#958; &#215; = (s 0 , q 0 )(s 1 , q 1 )..., generated by a random policy &#958; &#215; on the product POMDP</p><p>The belief state b t (s) represents the probability distribution of the agent in POMDP state s &#8712; S given the history up to time t. b t+1 is then updated from the previous belief state b t , the executed action a t ,a n dt h er e s u l t i n g observation o t+1 by (2) for all s &#8242; &#8712; S.Basedonthat,b t &#8712; B also has the Markovian property. Therefore, finding the optimalpolicyasafunctionofbeliefstatesinaPOMDP problem is equivalent to solving an MDP problem with a continuous belief state space. Moreover, according to previous studies on MDP problems with LTL specifications <ref type="bibr">[29,</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref>, the optimal policy &#958; &#215; * (b &#215; ) on the product POMDP P &#215; is also the optimal policy &#958; * (b) on the POMDP P satisfying LTL specifications. Consequently, Problem 3.1 can be reformulated as follows.</p><p>Problem 3.2: Given a product POMDP defined in Section 3.3 as P &#215; = P &#215; A of a POMDP P and an LDGBA A generated from LTL specifications &#966;,thecorresponding belief state space can be derived via <ref type="bibr">(10)</ref>. The objective is to find a policy &#958; &#215; * (b &#215; ) so that the expected return is maximized.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Reward redesign or tracking-frontier function</head><p>When directly applying the constructed LDGBA in solving a product POMDP problem defined in Problem 3.2, it may fail to find the deterministic policy that satisfies the LTL specifications <ref type="bibr">[20,</ref><ref type="bibr">21]</ref>.Forexample,tocompletethe task that an agent shall visit states labeled with 'a' and 'b' once at a time for infinitely many times, the LTL formula can be written as</p><p>Rabinizer 4 <ref type="bibr">[18]</ref> is used to convert the LTL formula into an LDGBA. We augment the accepting states in order to separate 'a' and 'b' transitions. In addition, we simplify the automaton by only keeping single labeled transitions, which is sufficient to explain this example. The full automaton after states augmentation is shown in Figure <ref type="figure">1</ref>, and the set of accepting sets is F ={{q 0 }, {q 1 }}. Theoretically, this automaton may accept a word of (b * a * ) &#969; where * &#969; * &#969; matche the preceding character(s) finite times and infinite times. However, since the typical reward design in (8) depends on the product states corresponding to the LDGBA accepting states, the agent may tend to keep visiting one of the labeled states infinitely many times to collect more rewards. Consequently, the specified task cannot be accomplished.</p><p>To address the above issue for surveillance tasks, we redesign the reward function as below by adding a constraint to <ref type="bibr">(8)</ref> such that the agent can visit the accepting sets repeatedly.</p><p>and q &#8242; = q 0o t h e r w i s e .</p><p>(15) where q = q &#8242; prevents the repeated transitions, which lead to the same automaton accepting state by removing the rewards on the associated labeled POMDP states. After applying this constraint to the reward function, the derived optimal policy satisfies the desired task specification. We provide the simulation of this example with the model-based SARSOP solver on the POMDP environment in Section 5.1.</p><p>In addition, we introduce another approach that implements a frontier set T to keep track of non-visited accepting sets based on the previous work by M. Cai <ref type="bibr">[32]</ref>. In most cases, T is initialized as F.I fas t a t eo fo n e or more accepting sets has been visited, those sets will be removed from the frontier set T . Mathematically, the tracking-frontier function T F is defined and updated as <ref type="bibr">[32]</ref>:</p><p>Once the frontier set T be c o m e se m p t y ,i tw i l lber e s e t as F if the specification requires the infinite visits of all accepting sets. Consequently, the accepted words in Figure <ref type="figure">1</ref> can be (ab) * , (ba) * .Itshallbenotedthatthevisited LDGBA accepting sets are removed from the frontier set T , and at the same time, the rewards on the associated labeled states are disabled. Therefore, it prevents the repeated visiting of the same automaton state. Furthermore, the original definition of reward in ( <ref type="formula">8</ref>) is redefined as</p><p>The above tracking-frontier function T F can be revised according to different task specifications. For example, additional constraints can be added to T F that only removes F i in order, forcing the agent to visit the labeled statesinaspecificsequence.Thisapproachisnoteasily directly implemented in the SARSOP solver but in DQN. Therefore, we only applied the frontier set to DQN in this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Problem solutions</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Point-based value iteration</head><p>Ap r o d u c tP O M D Pc a nb ev i e w e da saP O M D Pb u t simultaneously satisfies the task constraints provided by the LTL-induced automaton. The transitions in the product POMDP are restricted and prevented by the automaton transitions. Therefore, the proposed product POMDP problem can be solved by the methods that have been successfully applied to POMDP problems. It shall be notedthatthestatespaceisextendedduetotheproduct of POMDP and LDGBA.</p><p>One of our approaches uses SARSOP <ref type="bibr">[11]</ref>toapproximate the optimal value functions on product POMDP. To achieve this, SARSOP introduces a lower-bound target level L and an upper-bound target level U.At a r g e t gap size &#949; between L and U at b &#215; 0 is initialized. When the sampling process follows the belief tree T R , the target levels, i.e. L and U,areupdatedasL t and U t where t is the depth of the node in the tree. The sampling path will be terminated once the gap size between L t and U t for all the leaves in T R reaches &#947; -t &#949; where &#947; is the discount factor <ref type="bibr">[11]</ref>.</p><p>To update the lower and upper-bound target levels (L t and U t )f r o mL t-1 and U t-1 ,t h ef o l l o w i n ge q u a t i o ni s utilized first to calculate the lower bound of the optimal Qvalue <ref type="bibr">[11]</ref>onceanactionisselected.</p><p>where V, obtained by the fixed-action policy <ref type="bibr">[34]</ref>, is the lowerboundoftheoptimalvaluefunctionV * at b &#215; &#8242; .The upper bound of the optimal Q value, Q,isobtainedsimilar to <ref type="bibr">(18)</ref>, and its upper bound V can be acquired by sawtooth approximation <ref type="bibr">[34]</ref>. In addition, P(o|b &#215; , a &#215; ) is defined as <ref type="bibr">[10]</ref> </p><p>Then, Q is used to find an intermediate lower-bound target level L &#8242; as the maximum value between L t-1 and Q. Similarly, an intermediate upper-bound target level U &#8242; isthemaxim umval uebetweenU t-1 and Q + &#947; -t &#949;. Finally, the target level L t for the next belief node b &#215; &#8242; can be calculated, and it is needed for Q to achieve its target level L &#8242; .Similarly,U t is acquired by computing Q with U &#8242; <ref type="bibr">[11]</ref>.</p><p>Next, the standard backup process for &#945;-vectors in &#372; is performed with the value function approximation. The last step is the pruning process <ref type="bibr">[11]</ref>i nw h i c ht h e sub-optimal belief nodes and &#945;-vectors are pruned away. Finally, a more strict requirement of dominance check, called &#948;-dominance <ref type="bibr">[11]</ref>, is used to eliminate the suboptimal &#945;-vectors. Algorithm 1 describes the implementation of this approach to determine a set of &#945; vectors, &#372;, whichisusedtoapproximatetheoptimalvaluefunctions of the product POMDP P &#215; . Initialize the set of &#945; vectors &#372;, upper and lower bounds V and V of V * . V is set as the prediction of  <ref type="bibr">(5)</ref>.</p><p>if</p><p>Prune away &#945; 2 . end if end while Return &#372;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Deep Q-learning</head><p>Another approach for solving product POMDP problems is deep Q-learning (DQN). This approach employs neural networks to map a belief product state to the corresponding state-action values, i.e. Q values, for the agent choosing the best action. Compared to the SAR-SOPsolver,DQNrequiresmorecomputationtimesince i tu s e sM o n t eC a r l os i m u l a t i o n .H o w e v e r ,i ti se a sier to implement the tracking-frontier function <ref type="bibr">(16)</ref> T F (q, T ) inDQNthanintheSARSOPsolver .Asstated in Section 3.4, the LTL-induced LDGBA with the frontier set T canhandlemorecomplextaskswithoutadding extra computational complexity <ref type="bibr">[32]</ref> to the original derived automaton. In this study, the tracking-frontier function that can record the visited or non-visited accepting sets in each round is implemented in DQN.</p><p>Figure <ref type="figure">2</ref> illustrates the Q networks' architecture, similar to a convolutional neural network (CNN). The Q networks consist of 'Convolutional' layers, 'Flatten' layers, 'Fully-Connected' layers with linear activation functions at the end to generate the outputs as the approximated Q values of individual actions corresponding to the input belief state. In this study, a considered POMDP has a twodimensional state space, and the LTL-induced LDGBA state space provides an additional dimension. Consequently, a belief state of the product POMDP is a threedimensional array as the input to Q networks. Compared with artificial neural networks (ANNs), CNNs can automatically detect important features without any human supervision. In our Q network architecture, there are no 'maxpooling' layers because every belief point is important, and any missing information may cause a dramatic accuracy drop in the trained network.</p><p>In DQN, two identical Q networks are initialized: the evaluation network (Q e )andthetargetnetwork(Q t ). The target network is utilized for the next action selection and Q value prediction. The evaluation network is trained every M steps by a number (i.e. batch size) of experiences b &#215; i , a &#215; i , r i , b &#215; i+1 . After every K steps,thetargetnetwork is updated by copying the weight coefficients of the evaluation networks. Using two neural networks can prevent the bootstrapping of the DQN with a single neural network. In the so-called model replay, once the random sample set U(D) is collected, the new Q value is computed bytheequationas <ref type="bibr">[16]</ref>   <ref type="formula">16</ref>) and <ref type="bibr">(17)</ref>.</p><p>Calculate the rewards r i by <ref type="bibr">(13)</ref>.</p><p>Randomly select the samples U(D) with length as M.</p><p>while <ref type="bibr">(20)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>end while</head><p>Train Q e by the set of Q new .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>end if if i &gt; 0andi%K=0 then</head><p>Pass the weights of Q e to Q t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>end if end while end while</head><p>Training end and save the evaluation network Q e</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Simulation results</head><p>We evaluate our methodologies on two simulations with the discrete POMDP state spaces. We first carry out the simulations over a partially observable grid world. The product POMDP models are generated in Python 3.9 with Rabinizer 4 <ref type="bibr">[35]</ref> and then solved by Julia SARSOP solver <ref type="bibr">[11]</ref> and DQN (via Python), respectively. Furthermore, the grid-world simulation is scaled to different sizes of state and observation spaces to evaluate the scalability of the value iteration approach via SARSOP. Then, we apply the value iteration approach to a more realistic office scenario with TurtleBot2 via Pybullet 3.0 <ref type="bibr">[36]</ref>. All simulations are completed on a desktop with a 3.20 GHz  eight-core CPU and 32GB of RAM. Some source codes and supplementary materials are provided. 1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Grid-world simulation by SARSOP</head><p>Here considers a grid-world workspace with a size of 10 &#215; 10, as shown in Figure <ref type="figure">3</ref>.T h el a b e l e dc e l l sa r e marked with different colors representing different areas of interest. For example, 'a' and 'b' are labeled on the goal states that an agent may visit multiple times according to t h es p e c i fi e dt a s k .' c 'i st h el a b e lo nt h et r a p p i n gs t a t e s where an agent cannot escape once entered. 'B' represents theblockstates,i.e.obstacles,thattheagentisnotableto pass.</p><p>Theagent,e.g.themobilerobot,inthegridworldcan take four actions at each state: up, left, down,a n dright. After taking each action, the agent can move towards the desired state with a probability of 0.9 and move sideways with an equally weighted probability distribution o t h e rw i s e .I ft h en e x ts t a t ei so u t s i d et h eg r i dw o r l do r the obstacle, the agent will stay at the current state. In addition, the agent can observe the current state with a probability of 0.9 and adjacent states with a total probability of 0.1 uniformly distributed. The obstacles (i.e. states labeled with 'B') cannot be observed. The discount factor &#947; is all set as 0.95.</p><p>In the grid world, the agent is required to visit states labeled 'a' and 'b' once at a time for infinitely many times, as stated in Section 3.4. The LTL formula is written as <ref type="bibr">(14)</ref>, and the induced LDGBA is shown in Figure <ref type="figure">1</ref>. The POMDP environment has 100 states, and its corresponding product POMDP consists of 300 states. The set of accepting sets of the derived LDGBA is F = {{q 0 }, {q 1 }}, and the standard rewards in (8) are assigned with the labeled states transitions to q 0 and q 1 .</p><p>We first use the SARSOP solver to obtain the optimal value functions of belief states and then derive the optimal policy. Figure <ref type="figure">4</ref>(a) illustrates an induced path from as t a r ts t a t e ,m a r k e da sal a r g ep u r p l es o l i dc i r c l e .T h e agent moves to 'a' straightly, and the LDGBA state transits from q 0 to q 1 .InFigure4(b), it can be seen that the agent keepsvisitingstates'a'sinceq 1 has a recurrent transition by consuming the symbol 'a' in the LDGBA (Figure <ref type="figure">1</ref>). Theagentmustvisit'a'onq 1 infinite times to maximize the collected reward. Therefore, the specified task cannot be accomplished.</p><p>To address the above issue discussed in Section 3.4, we apply the redesigned reward in <ref type="bibr">(15)</ref> by only assigning rewards to the transitions that lead to the accepting states from a different automaton state. Then, after solving the problem again via SARSOP and obtaining the optimal policy, we generate another path, shown in Figure <ref type="figure">5</ref>. It illustrates a successful run in which the agent visits states'a'then'b'andkeepsvisitingthemalternativelyfor infinite times.</p><p>Starting from the initial state, shown in Figure <ref type="figure">5</ref>(a), the agent moves left straightly to reach the area labeled 'a', which leads the automaton transition from q 0 to q 1 . A bend in the route indicates the agent pauses there f o ra ne x t r at i m ed u et ot h ea c t i o nu n c e r t a i n t y .F i g u r e 5(b) shows the induced path on the POMDP at automaton state q 1 .T h ea g e n tl e a v e sa r e a' a 'a n dm o v e sr i g h t then up to reach the green area labeled as 'b'. Up to this point, the agent successfully completes one round, and   the automaton state transits back to q 0 .InFigure5(c), the agent moves from 'b' to 'a' again to start another round.</p><p>We also tested the algorithm on the same grid-world problem with various workspace sizes to show the computational complexity. The simulation results are listed in Table <ref type="table">1</ref>. It shows the times in seconds used to learn optimal policies.</p><p>A similar scalability study was performed in <ref type="bibr">[24]</ref>, in which the LTL formula was converted to DRA. They investigated grid worlds with state spaces up to 10 by 10, and the simulation time for the size of 10 by 10 is comparable to the one in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Grid-world simulation by DQN</head><p>We tested another model-based approach using DQN with the tracking-frontier function to the same grid-world problem. The input shape for Q networks is (10 &#215; 10 &#215; 3). The dimensions correspond to the size of a belief product state, including the row and column numbers of the workspace in the POMDP and the number of automaton states of LDGBA, respectively.</p><p>T h eQn e tw o r ka r c h i t ect u r ei ss h o wni nF i g u r e2.I n this study, two convolutional layers extract the features from the input belief state. One layer uses eight (4 &#215; 4) filters with stride (2, 2), and the other uses 16 (2 &#215; 2) filters with stride (1, 1). The outputs are converted to a 1D array by a 'Flatten' layer. Then, two fully-connected layers with 128 and 64 neurons are used to predict the corre-spondingQvalues.TheRectifiedLinearUnit(i.e.ReLU) is utilized as the activation function in the Q networks. The learning process includes 500 steps per episode for 8,000 episodes. Two Q networks, the evaluation network Q e and the target network Q t ,a r er a n d o m l yi n i t i a l i z e d . The training batch size for the evaluation network is 3 2 ,a n dt h et a r g e tn e t w o r ki su p d a t e db yc o p y i n gt h e weight coefficients of Q e every 50 steps. Since DQN uses Monte Carlo simulation, it is computationally intensive. The computing time for this simulation is 49450.4041 seconds. The frontier set T is initialized as {{q 0 }, {q 1 }}.T h e rewards are assigned on the states labeled either 'a' or 'b', resulting in the automaton transitions to q 0 or q 1 ,respectively. Once the automaton state in an accepting set of T is visited, this accepting set will be removed from T , and the corresponding reward is disabled too, as indic a t e di n( 1 6 ) .A f t e rl e a r n i n g ,t h ee v a l u a t i o nQn e t w o r k canpredicttheoptimalQvaluesforagivenbeliefstate. Consequently, the optimal policy can be derived. Figure <ref type="figure">6</ref> shows an induced path resulting in a successful run in that the agent recurrently visited states 'a' and 'b' once at a time for infinitely times. It shall be noted that the generated path differs from the one in Figure <ref type="figure">5</ref> because of action uncertainty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">PyBullet TurtleBot simulations</head><p>This section considers an office environment generated from PyBullet3.0 <ref type="bibr">[36]</ref>, a virtual robotic platform, as shown in Figure <ref type="figure">7</ref>. First, we map this virtual office environmenttoagridworldtolearntheoptimalpolicy.Then, the policy is applied to TurtleBot2 in PyBullet, where we can implement the robotic dynamics and a PID controller to ensure the robot follows the path generated from the derived policy.</p><p>T h ee n v i r o n m e n th a sf o u ro ffi c er o o m sd e n o t e da s 'a', 'b', 'c', and 'd', the storage room as 'S', the printer's room as 'Print', and the supply station for the robot to r ec h a r g en o t eda s' S p l y ' .tw ob i gwi n d o w s( f a c i n gw e s t ) are located at offices 'a' and 'd'. In addition, the robot can also observe 'wall', 'hallway', and 'door'. The sensor and actuator uncertainties can be modeled as the observation and action probabilities in the POMDP abstraction, as discussed in <ref type="bibr">[37,</ref><ref type="bibr">38]</ref>. We consider two different observation settings in this example. We also assume that the TurtleBot can successfully follow its navigation controller by moving in the intended direction, with a probability of 0.9. Otherwise, the agent will accidentally move toward any other direction with the same probability (the total probability is 0.1). Moving toward the wall will keep the robot staying at the same location.</p><p>Theofficespaceisdividedin toa4b y4gridwith16 states. There are two big windows (facing west) located at offices 'a' and 'd' that the TurtleBot (i.e. the agent) can sense by its embedded sensors. In addition, the agent can also observe 'wall', 'hallway', and 'door'. Given a specified task, we can formulate this office scenario as a product POMDPproblemandapplytheSARSOPsolvertoobtain the optimal policy. Two different observation settings are considered in this study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1.">Observation of the surroundings</head><p>In this setting, the agent can observe the surrounding information of the current state in all four directions without a specific order. There are a total of 8 observations, and each state has only one observation. For In Case 1, starting from an initial position, the Turtle-Bot visits the printer's room to collect the documents and then carries the documents to offices 'a' or 'c' recurrently. Meanwhile, TurtleBot shall avoid the storage room. This task can be formulated via LTL as below, and the induced LDGBA with only single-label transitions has four automaton states.</p><p>Figure <ref type="figure">8</ref>(a) shows that, following the optimal policy, the TurtleBot starts from the end of the hallway (i.e. its initial position) and arrives at the printer's room first, as t h er e dr o u t ei n d i c a t e s .T h ea g e n tt h e nl e a v e si m m e d iately after picking up the documents and heads to office room 'c', as the yellow route shows. Figure <ref type="figure">8(b)</ref> shows that the TurtleBot leaves office 'c' and continuously completesthesecondround.SinceT urtleBotcanobserveall fourdirections,itmadethebeliefcon vergencefast,and the agent can plan the path accordingly. The path appears that TurtleBot efficiently completed the task with small action uncertainty indicated as the to-and-fro part of the yellow path.</p><p>C a s e2r e q u i r e st h eT u r t l e Bo tt ov i s i tt h es u p p l ys t ation right after it accomplishes the delivery task in Case 1, and the storage room must also be avoided during the entire task. The LTL formula of Case 2 in ( <ref type="formula">22</ref>) is extended from part of the formula (21) used in Case 1, denoted as &#981; 1 . Different from the task in Case 1, we only require the agent to complete one run in Case 2.</p><p>As shown in Figure <ref type="figure">9</ref>, the TurtleBot starts from office 'b' (as the initial state), moves to the print room, and delivers the documents to office 'a'. Then, it arrives at the 'Sply'stationtocompletethistask.T wostatesonthered r o u t ea r ev i s i t e dm o r et h a no n c eb e c a u s eo ft h ea c t i o n uncertainty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2.">Observation in a single direction</head><p>In another observation setting, the agent can only observeononedirectionateachstateitvisits.Theobservation set is 'wall', 'hallway', 'door', 'window'. Assuming each observation has the same chance of being detected by the agent's sensors, the observation probabilities can be calculated. For example, the agent can observe 'door', 'wall', and 'window' with probabilities of 0.25, 0.5, and 0.25, respectively, in office 'a'. It can be seen that the observation uncertainty is much higher than that in the previous observation setting. Consequently, the computation time is longer for the agent to learn the optimal policies, as shown in Table <ref type="table">2</ref>.</p><p>Compared with the path with all four-directional observations, a single observation element provides the agent less sense of what the current state is, the increased observation uncertainty brought the difficulty of the optimalpathplanning.AsFigures10 and 11 show, the agent encountered difficulty in selecting the suitable actions especially around the 'hallway' states between the office r oo m s' c ' ,' d ' ,a n dp r i n t e r ' sr oo m ,wh i c ha p pe a r sa st h e bends and edges in the planned paths.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions</head><p>This paper tends to solve POMDP problems with highlevel and complex tasks that can be formulated via LTL andthenconvertedtoLDGBAs.Suchamotionplanning p r o b l e mb e c a m ee q u i v a l e n tt ofi n d i n ga no p t i m a lp o licy on the product of POMDP and the induced LDGBA. Two model-based RL approaches are adopted to derive th eo p tim alpo licya safun ctio no fth ebe liefs ta t e .On e approach is based on the PBVI methods, and we utilize th eSARSO Pso l v ero nth ep r od uctPO MD Pt oa p p r o ximate the optimal value functions. In another approach, we employ a CNN architecture for Q-networks in DQN to map the relationship between a belief state and its optim a ls t a t e -a c t i o nv a l u e s( i . e .Qv a l u e s ) .T oa d d r e s st h e issue that LDGBA may fail to find the deterministic policy for some task specifications. We redesign the reward function and introduce a frontier-tracking function for the above-mentioned approaches, respectively. The simulations demonstrate that the agent could accomplish the specified tasks by following the derived optimal policies. In addition, we perform the scalability study on the value iteration approach. In contrast, DQN is more computationally intensive because many episodes are needed for convergence while updating the belief state at each step. However, with the implementation of the frontier-tracking function, DQN is adaptive to various  task specifications. In future works, our proposed DQN approach can be potentially extended to a hybrid framework accommodating both model-free and model-based approaches, in which the belief state can be updated based on the approximation of transition and observation likelihoods. Furthermore, this DQN approach can also be combined with Inverse reinforcement learning (IRL) in POMDPs to use the learned Q-values as a feature to extract the rewards in problems with large state spaces. Thisapproachhasbeenexploredinotherworks <ref type="bibr">[39,</ref><ref type="bibr">40]</ref> and can be further researched.</p><p>This study focuses on high-level motion planning to find the optimal policy for the agent to accomplish user-specified complex tasks. Although we use a virtual TurtleBot2 on PyBullet3.0 to validate the derived policy, it would be interesting to incorporate the internal controllers of a robot in real-world applications. Fainekos et al. In that case, the discrete action needs to be mapped to the desired linear and angular velocities, by considering the robot's dynamics and kinematics. The low-level controllers, such as a PID controller, can send the control signals to the actuators (e.g. the motors) to drive the wheels. At the same time, the observations captured by the agent's embedded sensors serve dual purposes in the POMDP framework. Firstly, they provide feedback to the trajectory control between the states. Secondly, the observations can also be used to update the agent's belief state for the high-level decision-making process.</p><p>The proposed model-based approaches have limitationsbecausethetransitionandobservationprobabilities must be provided to update the belief state. Model-free RL methods will be considered for POMDPs with LTL specifications in future works. In that case, the decisionmaking will depend on the history of observations or actions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Note</head><p>1. <ref type="url">https://github.com/JunchaoLi001/LDGBA-Model\_</ref> </p><p>Checking.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Disclosure statement</head><p>No potential conflict of interest was reported by the author(s).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Funding</head><p>Li, Wang, and Xiao would like to thank US Department of Education (ED#P116S210005) and US National Science Foundation NSF (#2226936) for supporting this research. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Notes on contributors</head></div></body>
		</text>
</TEI>
