<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Beyond Black-Box Advice: Learning-Augmented Algorithms for MDPs with Q-Value Predictions</title></titleStmt>
			<publicationStmt>
				<publisher>NeurIPS</publisher>
				<date>02/13/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10538721</idno>
					<idno type="doi"></idno>
					
					<author>Tongxin Li</author><author>Yiheng Lin</author><author>Shaolei Ren</author><author>Adam Wierman</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study the tradeoff between consistency and robustness in the context of a single-trajectory time-varying Markov Decision Process (MDP) with untrusted machine-learned advice. Our work departs from the typical approach of treating advice as coming from black-box sources by instead considering a setting where additional information about how the advice is generated is available. We prove a first-of-its-kind consistency and robustness tradeoff given Q-value advice under a general MDP model that includes both continuous and discrete state/action spaces. Our results highlight that utilizing Q-value advice enables dynamic pursuit of the better of machine-learned advice and a robust baseline, thus result in near-optimal performance guarantees, which provably improves what can be obtained solely with black-box advice.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Machine-learned predictions and hand-crafted algorithmic advice are both crucial in online decisionmaking problems, driving a growing interest in learning-augmented algorithms <ref type="bibr">[1,</ref><ref type="bibr">2]</ref> that exploit the benefits of predictions to improve the performance for typical problem instances while bounding the worst-case performance <ref type="bibr">[3,</ref><ref type="bibr">4]</ref>. To this point, the study of learning-augmented algorithms has primarily viewed machine-learned advice as potentially untrusted information generated by black-box models. Yet, in many real-world problems, additional knowledge of the machine learning models used to produce advice/predictions is often available and can potentially improve the performance of learning-augmented algorithms.</p><p>A notable example that motivates our work is the problem of minimizing costs (or maximizing rewards) in a single-trajectory Markov Decision Process (MDP). More concretely, a value-based machine-learned policy e &#8673; can be queried to provide suggested actions as advice to the agent at each step <ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref>. Typically, the suggested actions are chosen to minimize (or maximize, in case of rewards) estimated cost-to-go functions (known as Q-value predictions) based on the current state.</p><p>Naturally, in addition to suggested actions, the Q-value function itself can also provide additional information (e.g., the long-term impact of choosing a certain action) potentially useful to the design of a learning-augmented algorithm. Thus, this leads to two different designs for learning-augmented algorithms in MDPs: black-box algorithms and grey-box algorithms. A learning-augmented algorithm using e &#8673; is black-box if e &#8673; provides only the suggested action e u to the learning-augmented algorithm, whereas it is value-based (a.k.a., grey-box) if e &#8673; provides an estimate of the Q-value function e Q (that also implicitly includes a suggested action e u obtained by minimizing e Q) to the learning-augmented algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Value-based policies e</head><p>&#8673; often perform well empirically in stationary environments in practice <ref type="bibr">[5,</ref><ref type="bibr">6]</ref>. However, they may not have performance guarantees in all environments and can perform poorly at times due to a variety of factors, such as non-stationary environments <ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref>, policy collapse <ref type="bibr">[12]</ref>, sample inefficiency <ref type="bibr">[13]</ref>, and/or when training data is biased <ref type="bibr">[14]</ref>. As a consequence, such policies often are referred to as "untrusted advice" in the literature on learning-augmented algorithms, where the notion of "untrusted" highlights the lack of performance guarantees. In contrast, recent studies in competitive online control <ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref><ref type="bibr">[20]</ref><ref type="bibr">[21]</ref> have begun to focus on worst-case analysis and provide control policies &#8673; with strong performance guarantees even in adversarial settings, referred to as robustness, i.e., &#8673; provides "trusted advice." Typically, the goal of a learning-augmented online algorithm <ref type="bibr">[1,</ref><ref type="bibr">3]</ref> is to perform nearly as well as the untrusted advice when the machine learned policy performs well, a.k.a., achieve consistency, while also ensuring worst-case robustness. Combining the advice of an untrusted machine-learned policy e &#8673; and a robust policy &#8673; naturally leads to a tradeoff between consistency and robustness. In this paper, we explore this tradeoff in a time-varying MDP setting and seek to answer the following key question for learning-augmented online algorithms:</p><p>Can Q-value advice from an untrusted machine-learned policy, e &#8673;, in a grey-box scenario provide more benefits than the black-box action advice generated by e &#8673; in the context of consistency and robustness tradeoffs for MDPs?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Contributions</head><p>We answer the question above in the affirmative by presenting and analyzing a unified projectionbased learning-augmented online algorithm (PROjection Pursuit policy, simplified as PROP in Algorithm 1) that combines action feedback from a trusted, robust policy &#8673; with an untrusted ML policy e &#8673;. In addition to offering a consistency and robustness tradeoff for MDPs with black-box advice, our work moves beyond the black-box setting. Importantly, by considering the grey-box setting, the design of PROP demonstrates that the structural information of the untrusted machinelearned advice can be leveraged to determine the trust parameters dynamically, which would otherwise be challenging (if not impossible) in a black-box setting. To our best knowledge, PROP is the first-ofits-kind learning-augmented algorithm that applies to general MDP models, which allow continuous or discrete state and action spaces.</p><p>Our main results characterize the tradeoff between consistency and robustness for both black-box and grey-box settings in terms of the ratio of expectations, RoE, built upon the traditional consistency and robustness metrics in <ref type="bibr">[3,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">4]</ref> for the competitive ratio. We show in Theorem 5.2 that for the black-box setting, PROP is (1 + O(( <ref type="formula">1</ref>) ))-consistent and (ROB + O( ))-robust where 0 &#63743; &#63743; 1 is a hyper-parameter. Moreover, for the black-box setting, PROP cannot be both (1 + o( ))-consistent and (ROB + o(( <ref type="formula">1</ref>) ))-robust for any 0 &#63743; &#63743; 1 where is the diameter of the action space. In sharp contrast, by using a careful design of a robustness budget parameter in PROP with Q-value advice (grey-box setting), PROP is 1-consistent and (ROB + o(1))-robust.</p><p>Our result highlights the benefits of exploiting the additional information informed by the estimated Q-value functions, showing that the ratio of expectations can approach the better of the two policies e &#8673; and &#8673; for any single-trajectory time-varying, and even possibly adversarial environments -if the value-based policy e &#8673; is near-optimal, then the worst-case RoE(PROP) can approach 1 as governed by a consistency parameter; otherwise, RoE(PROP) can be bounded by the ratio of expectations of &#8673; subject to an additive term o(1) that decreases when the time horizon T increases.</p><p>A key technical contribution of our work is to provide the first quantitative characterization of the consistency and robustness tradeoff for a learning-augmented algorithm (PROP) in a general MDP model, under both standard black-box and novel grey-box settings. Importantly, PROP is able to leverage a broad class of robust policies, called Wasserstein robust policies, which generalize the well-known contraction principles that are satisfied by various robust policies <ref type="bibr">[24]</ref> and have been used to derive regrets for online control <ref type="bibr">[19,</ref><ref type="bibr">25]</ref>. A few concrete examples of Wasserstein robust policies applicable for PROP are provided in Table <ref type="table">1</ref>(Section 3.1).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Related Work</head><p>Learning-Augmented Algorithms with Black-Box Advice. The concept of integrating black-box machine-learned guidance into online algorithms was initially introduced by <ref type="bibr">[26]</ref>. <ref type="bibr">[3]</ref> coined terms "robustness" and "consistency" with formal mathematical definitions based on the competitive ratio. Over the past few years, the consistency and robustness approach has gained widespread popularity and has been utilized to design online algorithms with black-box advice for various applications, including ski rental <ref type="bibr">[3,</ref><ref type="bibr">22,</ref><ref type="bibr">23]</ref>, caching <ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref>, bipartite matching <ref type="bibr">[30]</ref>, online covering <ref type="bibr">[31,</ref><ref type="bibr">32]</ref>, convex body chasing <ref type="bibr">[4]</ref>, nonlinear quadratic control <ref type="bibr">[33]</ref>. The prior studies on learning-enhanced algorithms have mainly focused on creating meta-strategies that combine online algorithms with black-box predictions, and typically require manual setting of a trust hyper-parameter to balance consistency and robustness. A more recent learning-augmented algorithm in <ref type="bibr">[33]</ref> investigated the balance between competitiveness and stability in nonlinear control in a black-box setting. However, this work limits the robust policy to a linear quadratic regulator and does not provide a theoretical basis for the selection of the trust parameters. <ref type="bibr">[34]</ref> generalized the black-box advice setting by considering distributional advice.</p><p>Online Control and Optimization with Structural Information. Despite the lack of a systematic analysis, recent studies have explored the usage of structural information in online control and optimization problems. Closest to our work, <ref type="bibr">[7]</ref> considered a related setting where the Q-value function is available as advice, and shows that such information can be utilized to reduce regret in a tabular MDP model. In contrast, our analysis applies to more general models that allow continuous state/action spaces. In <ref type="bibr">[17]</ref>, the dynamical model and the predictions of disturbances in a linear control system are shown to be useful in achieving a near-optimal consistency and robustness tradeoff. The predictive optimization problem solved by MPC <ref type="bibr">[35,</ref><ref type="bibr">36,</ref><ref type="bibr">16,</ref><ref type="bibr">37]</ref> can be regarded as a special realization of grey-box advice, where an approximated cost-to-go function is constructed from structural information that includes the (predicted) dynamical model, costs, and disturbances.</p><p>MDP with External Feedback. Feedback from external sources such as control baselines <ref type="bibr">[38,</ref><ref type="bibr">39]</ref>, visual explanations <ref type="bibr">[40]</ref>, and human experts <ref type="bibr">[41]</ref><ref type="bibr">[42]</ref><ref type="bibr">[43]</ref> is often available in MDP. This external feedback can be beneficial for various purposes, such as ensuring safety <ref type="bibr">[44]</ref>, reducing variance <ref type="bibr">[38]</ref>, training human-like chatbots <ref type="bibr">[41]</ref>, and enhancing overall trustworthiness <ref type="bibr">[45]</ref>, among others. The use of control priors has been proposed by <ref type="bibr">[38]</ref> as a way to guarantee the Lyapunov stability of the training process in reinforcement learning. They used the Temporal-Difference method to tune a coefficient that combines a RL policy and a control prior, but without providing a theoretical foundation. Another related area is transfer learning in RL, where external Q-value advice from previous tasks can be adapted and utilized in new tasks. Previous research has shown that this approach can outperform an agnostic initialization of Q, but these results are solely based on empirical observations and lack theoretical support <ref type="bibr">[46]</ref><ref type="bibr">[47]</ref><ref type="bibr">[48]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem Setting</head><p>We consider a finite-horizon, single-trajectory, time-varying MDP with T discrete time steps. The state space X is a subset of a normed vector space embedded with a norm k &#8226; k X . The actions are chosen from a convex and compact set U in a normed vector space characterized by some norm k &#8226; k U . Notably, U can represent either continuous actions or the probability distributions used when choosing actions from a finite set. <ref type="foot">1</ref> The diameter of the action space U is denoted by := max u2U kuk U . Denote [T ] := {0, . . . , T 1}. For each time step t 2 [T ], let P t : X &#8677; U ! P X be the transition probability, where P X is a set of probability measures on X . We consider time-varying costs c t : X &#8677; U ! R + , while rewards can be treated similarly by adding a negative sign. An initial state x 0 2 X is fixed. This MDP model is compactly represented by MDP(X , U , T, P, c).</p><p>The goal of a policy in this MDP setting is to minimize the total cost over all T steps. The policy agent has no access to the full MDP. At each time step t 2 [T ], only the incurred cost value c t (x t , u t ) and the next state x t+1 &#8672; P t (&#8226;|x t , u t ) are revealed to the agent after playing an action u t 2 U. We denote a policy by &#8673; = (&#8673; t : t 2 [T ]) where each &#8673; t : X ! U chooses an action u t when observing x t at step t 2 [T ]. Note that our results can be generalized to the setting when &#8673; t is stochastic and outputs a probability distribution on U . Given MDP(X , U, T, P, c), we consider an optimization with time-varying costs and transition dynamics. Thus, our goal is to find a policy &#8673; that minimizes the following expected total cost:</p><p>where the randomness in E P,&#8673; is from the transition dynamics P = (P t : t 2 [T ]) and the policy &#8673; = (&#8673; t : t 2 [T ]). We focus our analysis on the expected dynamic regret and the ratio of expectations, defined below, as the performance metrics for our policy design.</p><p>Definition 1 (Expected dynamic regret). Given MDP(X , U , T, P, c), the (expected) dynamic regret of a policy &#8673; = (&#8673; t : t 2 [T ]) is defined as the difference between the expected cost induced by the policy &#8673;, J(&#8673;) in (1), and the optimal expected cost J ? := inf &#8673; J(&#8673;), i.e., DR(&#8673;) := J(&#8673;) J ? .</p><p>Dynamic regret is a more general (and often more challenging to analyze) measure than classical static regret, which has been mostly used for stationary environments <ref type="bibr">[49,</ref><ref type="bibr">50]</ref>. The following definition of the ratio of expectations <ref type="bibr">[51,</ref><ref type="bibr">52]</ref> will be used as an alternative performance metric in our main results.</p><p>Definition 2 (Ratio of expectations). Given MDP(X , U , T, P, c), the ratio of expectations of a policy &#8673; = (&#8673; t : t 2 [T ]) is defined as RoE(&#8673;) := J(&#8673;)/J ? where J(&#8673;) and J ? are the same as in Definition 1.</p><p>Dynamic regret and the ratio of expectations defined above also depend on the error of the untrusted ML advice; we make this more explicit in Section 3.2. Next, we state the following continuity assumption, which is standard in MDPs with continuous action and state spaces <ref type="bibr">[53]</ref><ref type="bibr">[54]</ref><ref type="bibr">[55]</ref>. Note that our analysis can be readily adapted to general H&#246;lder continuous costs with minimal modifications.</p><p>Assumption 1 (Lipschitz costs). For any time step t 2 [T ], the cost function c t :</p><p>x 2 X , and u 2 U. Our objective is to achieve a balance between the worst-case guarantees on cost minimization in terms of dynamic regret provided by a robust policy, &#8673;, and the average-case performance of a valued-based policy, e &#8673;, in the context of MDP(X , U , T, P, c). In particular, we denote by ROB 1 a ratio of expectation bound of the robust policy &#8673; such that the worst case RoE(&#8673;) &#63743; ROB. In the learningaugmented algorithms literature, these two goals are referred to as consistency and robustness <ref type="bibr">[3,</ref><ref type="bibr">1]</ref>. Informally, robustness refers to the goal of ensuring worst-case guarantees on cost minimization comparable to those provided by &#8673; and consistency refers to ensuring performance nearly as good as e &#8673; when e &#8673; performs well (e.g., when the instance is not adversarial). Learning-augmented algorithms seek to achieve consistency and robustness by combining &#8673; and e &#8673;, as illustrated in Figure <ref type="figure">1</ref>. Our focus in this work is to design robust and consistent algorithms for two types of advice: black-box advice and grey-box advice. The type of advice that is nearly always the focus in the learningaugmented algorithm literature is black-box advice -only providing a suggested action e u t without additional information. In contrast, on top of the action e u t , grey-box advice can also reveal the internal state of the learning algorithm, e.g., the Q-value e Q t in our setting. This contrast is illustrated in Figure <ref type="figure">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Consistency and Robustness in MDPs</head><p>Compared to black-box advice, grey-box advice has received much less attention in the literature, despite its potential to improve tradeoffs between consistency and robustness as recently shown in <ref type="bibr">[34,</ref><ref type="bibr">17]</ref>. Nonetheless, the extra information on top of the suggested action in a grey-box setting potentially allows the learning-augmented algorithm to make a better-informed decision based on the advice, thus achieving a better tradeoff between consistency and robustness than otherwise possible.</p><p>In the remainder of this section, we discuss the robustness properties for the algorithms we consider in our learning-augmented framework (Section 3.1), and introduce the notions of consistency in our grey-box and black-box models in Section 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Locally Wasserstein-Robust Policies</head><p>We begin with constructing a novel notion of robustness for our learning-augmented framework based on the Wasserstein distance as follows. Denote the robust policy by &#8673; := (&#8673; t : t 2 [T ]), where each &#8673; t maps a system state to a deterministic action (or a probability of actions in the stochastic setting). Denote by &#8674; t1:t2 (&#8674;) the joint distribution of the state-action pair (x t , u t ) 2 X &#8677; U at time t 2 2 [T ] when implementing the baselines &#8673; t1 , . . . , &#8673; t2 consecutively with an initial state-action distribution &#8674;. We use k &#8226; k X &#8677;U := k &#8226; k X + k &#8226; k U as the included norm for the product space X &#8677; U. Let W p (&#181;, &#9003;) denote the Wasserstein p-distance between distributions &#181; and &#9003; whose support set is X &#8677; U:</p><p>where p 2 [1, 1) and J (&#181;, &#9003;) denotes a set of all joint distributions J with a support set X &#8677; U that have marginals &#181; and &#9003;. Next, we define a robustness condition for our learning-augmented framework. Definition 3 (r-locally p-Wasserstein robustness). A policy &#8673; = (&#8673; t : t 2 [T ]) is r-locally p-Wasserstein-robust if for any 0 &#63743; t 1 &#63743; t 2 &lt; T and any pair of state-action distributions &#8674;, &#8674; 0 where the the p-Wasserstein distance between them is bounded by W p (&#8674;, &#8674; 0 ) &#63743; r, for some radius r &gt; 0, the following inequality holds:</p><p>for some function s : [T ] ! R + satisfying P t2[T ] s(t) &#63743; C s where C s &gt; 0 is a constant. Our robustness definition is naturally more relaxed than the usual contraction property in the control/optimization literature <ref type="bibr">[25,</ref><ref type="bibr">35]</ref> -if any two different state-action distributions converge exponentially with respect to the Wasserstein p-distance, then a policy &#8673; is r-locally p-Wasserstein-robust. This is illustrated in Figure <ref type="figure">2</ref>. Note that, although the Wasserstein robustness in Definition 3 well captures a variety of distributional robustness metrics such as the total variation robustness defined on finite state/action spaces, it can also be further generalized to other metrics for probability distributions.</p><p>As shown in Appendix A (provided in the supplementary material), by establishing a connection between the Wasserstein distance and the total variation metric, any policy that induces a regular Markov chain satisfies the fast mixing property and the state-action distribution will converge with respect to the total variation distance to a stationary distribution <ref type="bibr">[56]</ref>. A more detailed discussion can be found in Appendix A.2. Moreover, the Wasserstein-robustmess in Definition 3 includes a set of contraction properties in control theory as special cases. For example, for a locally Wasserstein-robust policy, if the transition kernel P and the baseline policy &#8673; are deterministic, then the state-action distributions become point masses, reducing Definition 3 to a state-action perturbation bound in terms of the `2-norm when implementing the policy &#8673; from different starting states <ref type="bibr">[35,</ref><ref type="bibr">19]</ref>.</p><p>The connections discussed above highlight the existence of several well-known robust policies that satisfy Definition 3. Besides the case of discrete MDPs discussed in Appendix A.2, another prominent example is model predictive control (MPC), for which robustness follows from the results in <ref type="bibr">[19]</ref> (see Appendix A.1 for details). The model assumption below will be useful in our main results. Assumption 2. There exists a -locally p-Wasserstein-robust baseline control policy (Definition 3) &#8673; for some p 1, where is the diameter of the action space U .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Consistency and Robustness for RoE</head><p>In parallel with the notation of "consistency and robustness" in the existing literature on learningaugmented algorithms <ref type="bibr">[3,</ref><ref type="bibr">1]</ref>, we define a new metric of consistency and robustness in terms of RoE. To do so, we first introduce an optimal policy &#8673; ? . Based on MDP(X , U , T, P, c), let &#8673; ? t = (&#8673; ? t : t 2 [T ]) denote the optimal policy at each time step t 2 [T ], whose optimal Q-value function is</p><p>where E P,&#8673; denotes an expectation with respect to the randomness of the trajectory {(x t , u t ) : t 2</p><p>[T ]} obtained by following a policy &#8673; and the transition probability P at each step t 2 [T ]. The Bellman optimality equations can then be expressed as</p><p>where we write</p><p>This indicates that for each time step t 2 [T ], &#8673; ? t is the greedy policy with respect to its optimal Q-value functions (Q ? t : t 2 [T ]). Note that for any t 2 [T ], Q ? t (x, u) = 0. Given this setup, the value-based policies e &#8673; := (e &#8673; t : t 2 [T ]) take the following form. For any t 2 [T ], a value-based policy e &#8673; t : X ! U produces an action e u t 2 arg min v2U e Q t (x t , v) by minimizing an estimate of the optimal Q-value function e Q t .</p><p>We make the following assumption on the machine-learned untrusted policy e &#8673; and the Q-value advice.</p><p>Assumption 3. The machine-learned untrusted policy e &#8673; is value-based. The Q-value advice e</p><p>We can now define a consistency measure for Q-value advice e Q t , which measures the error of the estimates of the Q-value functions due to approximation error and time-varying environments, etc. Let p 2 (0, 1]. Fix a sequence of distributions &#8674; = (&#8674; t : t 2 [T ]) whose support set is X &#8677; U and let t be the marginal distribution of &#8674; t on X . We define a quantity representing the error of the Q-value advice</p><p>where</p><p>A policy with Q-value functions {Q t : t 2 [T ]} is said to be (", p, &#8674;)-consistent if there exists an " satisfying (4). In addition, a policy is (0, 1)-consistent if e Q t is a Lebesgue-measurable function for all t 2 [T ] and (1, ")-consistent if the L 1 -norm satisfies</p><p>The consistency error of a policy in (4) quantifies how the Q-value advice is close to optimal Q-value functions. It depends on various factors such the function approximation error or training error due to the distribution shift, and has a close connection to a rich literature on value function approximation <ref type="bibr">[57]</ref><ref type="bibr">[58]</ref><ref type="bibr">[59]</ref><ref type="bibr">[60]</ref><ref type="bibr">[61]</ref>. The results in <ref type="bibr">[59]</ref> generalized the worstcase L 1 guarantees to arbitrary L p,&#8674; -norms under some mixing assumptions via policy iteration for a stationary Markov decision process (MDP) with a continuous state space and a discrete action space. Recently, approximation guarantees for the average case for parametric policy classes (such as a neural network) of value functions have started to appear <ref type="bibr">[57,</ref><ref type="bibr">58,</ref><ref type="bibr">60]</ref>. These bounds are useful in lots of supervised machine learning methods such as classification and regression, whose bounds are typically given on the expected error under some distribution. These results exemplify richer instances of the consistency definition (see ( <ref type="formula">4</ref>)) and a summary of these bounds can be found in <ref type="bibr">[61]</ref>. Now, we are ready to introduce our definition of consistency and robustness with respect to the ratio of expectations, similar to the growing literature on learning-augmented algorithms <ref type="bibr">[3,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">4]</ref>. We write the ratio of expectations RoE(") of a policy &#8673; as a function of the Q-value advice error " in terms of the L 1 norm, defined in (4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 4 (Consistency and Robustness</head><p>). An algorithm &#8673; is said to be k-consistent if its worst-case (with respect to the MDP model MDP(X , U, T, P, c)) ratio of expectations satisfies RoE(") &#63743; k for " = 0. On the other hand, it is l-robust if RoE(") &#63743; l for any " &gt; 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">The Projection Pursuit Policy (PROP)</head><p>In this section we introduce our proposed algorithm (Algorithm 1), which achieves near-optimal consistency while bounding the robustness by leveraging a robust baseline (Section 3.1) in combination with value-based advice (Section 3.2). A key challenge in the design is how to exploit the benefits of good value-based advice while avoiding following it too closely when it performs poorly. To address this challenge, we propose to judiciously project the value-based advice into a neighborhood of the robust baseline. By doing so, the actions we choose can follow the value-based advice for consistency while staying close to the robust baseline for robustness. More specifically, at each step t 2 [T ], we choose u t = Proj U t (e u t ) where a projection operator Proj U t (&#8226;) : U ! U is defined as</p><p>corresponding to the projection of u onto a ball U t := {u 2 U : ku &#8673; t (x t )k U &#63743; R t }. Note that when the optimal solution of ( <ref type="formula">5</ref>) is not unique, we choose the one on the same line with &#8673; t (x t ) u.</p><p>The PROjection Pursuit policy, abbreviated as PROP, can be described as follows. For a time step t 2 [T ], let e &#8673; t : X ! U denote a policy that chooses an action e u t (arbitrarily choose one if there are multiple minimizers of e Q t ), given the current system state x t at time t 2 [T ] and step t 2 [T ]. An action u t = Proj U t (e u t (x t )) is selected by projecting the machine-learned action e u t (x t ) onto a norm ball U t defined by the robust policy &#8673; given a radius R t 0. Finally, PROP applies to both black-box and grey-box settings (which differ from each other in terms of how the radius R t is decided). The results under both settings are provided in Section 5, revealing a tradeoff between consistency and robustness.</p><p>The radii (R t : t 2 [T ]) can be interpreted as robustness budgets and are key design parameters that determine the consistency and robustness tradeoff. Intuitively, the robustness budgets reflect the trustworthiness on the value-based policy e &#8673; -the larger budgets, the more trustworthiness and hence the more freedom for PROP to follow e &#8673;. How the robustness budget is chosen differentiates the grey-box setting from the black-box one.</p><p>Algorithm 1 PROjection Pursuit Policy (PROP) Initialize :Untrusted policy e &#8673; = (e &#8673; t : t 2 [T ]) and baseline policy &#8673; = (&#8673; t : t 2 [T ]) 1 for t = 0, . . . , T 1 do 2 //Implement black-box (Section 4.1) or grey-box (Section 4.2) procedures 3 (e u t , R t ) BLACK-BOX(x t ) or (e u t , R t ) GREY-BOX(x t ) 4 Set action u t = Proj U t (e u t ) where U t := {u 2 U : ku &#8673; t (x t )k U &#63743; R t } 5 Sample next state x t+1 &#8672; P t (&#8226;|x t , u t ) 6 end 4.1 Black-Box Setting</p><p>In the black-box setting, the only information provided by e &#8673; is a suggested action e u for the learningaugmented algorithm. Meanwhile, the robust policy &#8673; can also be queried to provide advice u. Thus, without additional information, a natural way to utilize both e &#8673; and &#8673; is to decide a projection radius at each time based on the how the obtained e u and u. More concretely, at each time t 2 [T ], the robustness budget R t is chosen by the following BLACK-BOX Procedure, where we set R t = &#8984; t with &#8984; t := ke u t u t k U representing the difference between the two advice measured in terms of the norm k &#8226; k U and 0 &#63743; &#63743; 1 being a tradeoff hyper-parameter that measures the trustworthiness on the machine-learned advice. The choice of R t = &#8984; t can be explained as follows. The value of &#8984; t indicates the intrinsic discrepancy between the robust advice and the machine-learned untrusted advice -the larger discrepancy, the more difficult to achieve good consistency and robustness simultaneously. Given a robust policy and an untrusted policy, by setting a larger , we allow the actual action to deviate more from the robust advice and to follow the untrusted advice more closely, and vice versa. is a crucial hyper-parameter that can be pre-determined to yield a desired consistency and robustness tradeoff. The computation of R t is summarized in Procedure 1 below.</p><p>Procedure 1 BLACK-BOX Procedure at t 2 [T ] (Input: state x t and hyper-parameter 0 &#63743; &#63743; 1) Implement e &#8673; t and &#8673; t to obtain e u t and u t , respectively. Set robustness budget R t = &#8984; t where &#8984; t := ke u t u t k U ; Return (e u t , R t )</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Grey-Box Setting</head><p>In the grey-box setting, along with the suggested action e u, the value-based untrusted policy e &#8673; also provides an estimate of the Q-value function e Q that indicates the long-term cost impact of an action. To utilize such additional information informed by e Q t at each time t 2 [T ], we propose a novel algorithm that dynamically adjusts the budget R t to further improve the consistency and robustness tradeoff. More concretely, let us consider the Temporal-Difference (TD) error TD t = c t 1 + P t 1 e</p><p>V t e Q t 1 . Intuitively, if a non-zero TD-error is observed, the budget R t needs to be decreased so as to minimize the impact of the learning error. However, the exact TD-error is difficult to compute in practice, since it requires complete knowledge of the transition kernels (P t : t 2 [T ]). To address this challenge, we use the following estimated TD-error based on previous trajectories: t (x t , x t 1 , u t 1 ) :=c t 1 (x t 1 , u t 1 ) + inf v2U e Q t (x t , v) e Q t 1 (x t 1 , u t 1 ) .</p><p>Denote by &gt; 0 a hyper-parameter. Based on the estimated TD-error in (6), the robustness budget in Algorithm 1 is set as</p><p>which constitutes two terms. The first term &#8984; t := ke &#8673; t (x t ) &#8673; t (x t )k measures the decision discrepancy between the untrusted policy e &#8673; and the baseline policy &#8673;, which normalizes the total budget, similar to the one used in the black-box setting in Procedure 1. The second term is the approximate TD-error, which is normalized by the Lipschitz constant L Q of Q-value functions. With these terms defined, the GREY-BOX Procedure below first chooses a suggested action e u t by minimizing e Q t and then decides a robustness budget R t using <ref type="bibr">(7)</ref>.</p><p>may not be fully trusted. On the other hand, this sharp contrast between the black-box and grey-box settings reveals that having access to information of value function can improve the tradeoff between consistency and robustness (see Definition 4) non-trivially. A proof of Theorem 5.4 can be found in Appendix D. Applications of our main results are discussed in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Concluding Remarks</head><p>Our results contribute to the growing body of literature on learning-augmented algorithms for MDPs and highlight the importance of considering consistency and robustness in this context. In particular, we have shown that by utilizing the structural information of machine learning methods, it is possible to achieve improved performance over a black-box approach. The results demonstrate the potential benefits of utilizing value-based policies as advice; however, there remains room for future work in exploring other forms of structural information.</p><p>Limitations and Future Work. One limitation of our current work is the lack of analysis of more general forms of black-box procedures. Understanding and quantifying the available structural information in a more systematic way is another future direction that could lead to advances in the design of learning-augmented online algorithms and their applications in various domains.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The action space U is assumed to be a continuous, convex, and compact set for more generality. When the actions are discrete, U can be defined as the set of all probability distributions on a finite action space. We relegate the detailed discussions in Appendix A.2 and F.</p></note>
		</body>
		</text>
</TEI>
