<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Learning-Augmented Online Control for Decarbonizing Water Infrastructures</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/16/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10661266</idno>
					<idno type="doi">10.1145/3679240.3734595</idno>
					
					<author>Jianyi Yang</author><author>Pengfei Li</author><author>Tongxin Li</author><author>Adam Wierman</author><author>Shaolei Ren</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Water supply is a critical utility for numerous infrastructures, including residential and commercial buildings, manufacturing facilities, and data centers. Globally, water systems consume about 4% of the total electricity use <ref type="bibr">[3]</ref>. In municipalities, energy consumption of water systems typically accounts for approximately 30% to 40% of the total electricity use <ref type="bibr">[2]</ref>. In the United States alone, the energy costs associated with water infrastructure amount to around 4 billion annually and contribute over 45 million tons of greenhouse gases <ref type="bibr">[2]</ref>. Pumping is usually the most energy-intensive part of water infrastructures, representing up to 80% of the energy consumed by municipal water systems <ref type="bibr">[6]</ref>. This significant energy consumption has spurred widespread interest in optimizing water pump systems to reduce both greenhouse gas emissions and monetary costs <ref type="bibr">[2,</ref><ref type="bibr">3,</ref><ref type="bibr">6]</ref>.</p><p>In most critical infrastructures, water supply systems use storage tanks to ensure a reliable water provision. Pumps are employed to maintain adequate water levels of these tanks to meet water demand. Beyond providing a reliable water supply, these tanks can serve as buffers that can be exploited to manage pumping systems more efficiently, thereby reducing greenhouse gas emissions and monetary costs. With the integration of renewable energy, both carbon intensity and electricity prices fluctuate over time <ref type="bibr">[52,</ref><ref type="bibr">87]</ref>. This time-varying property, combined with the widespread deployment of sensors, allows water supply systems to dynamically schedule the activation and/or the speed of pumps with the goal of optimizing carbon/energy efficiency <ref type="bibr">[19,</ref><ref type="bibr">75]</ref>. Importantly, the scheduling policy should ensure safe water levels of the tanks to address any emergencies.</p><p>Water supply management is an online control problem characterized by time-varying dynamics and cost functions that are revealed sequentially to the pump controller. Such problems are challenging due to the uncertainty of future contexts including demand, carbon intensity, and/or energy prices <ref type="bibr">[9,</ref><ref type="bibr">27,</ref><ref type="bibr">56,</ref><ref type="bibr">97]</ref>. Without precise knowledge of the future contexts, the controllers of pumping systems are difficult to achieve high energy efficiency. Nevertheless, exploiting the data of water usage, carbon intensity and energy price, machine learning (ML) can be applied to overcome the uncertainties inherent in online control, often surpassing the performance of manually designed policies <ref type="bibr">[54,</ref><ref type="bibr">58,</ref><ref type="bibr">60,</ref><ref type="bibr">61]</ref>. Recently, ML predictions have been utilized in water supply systems to enhance cost savings and carbon efficiency <ref type="bibr">[19,</ref><ref type="bibr">93,</ref><ref type="bibr">94]</ref>.</p><p>However, ML can sometimes provide inaccurate predictions or low-quality advice, which can lead to arbitrarily poor performance and raise safety concerns for critical water infrastructures. For instance, a water tank in a conference center is crucial for ensuring a reliable water supply and fire protection. If the controller fails to maintain a safe water level, serious accidents can occur in the event of a municipal distribution system fault or a fire emergency. Naive deployments of ML-based controllers could result in such failures, leading to significant safety risks. Despite significant efforts to improve ML models for water supply systems <ref type="bibr">[19,</ref><ref type="bibr">83,</ref><ref type="bibr">84]</ref>, ML-based controllers fundamentally lack performance guarantees, especially for adversarial or out-of-distribution problem instances. Such lack of performance guarantees hinders the deployment of ML in realworld critical infrastructures.</p><p>To solve the fundamental challenges of ensuring worst-case performance guarantees for ML-based controllers, we propose a method that leverages control priors. Control priors are humancrafted online algorithms with provable worst-case performance guarantees <ref type="bibr">[39,</ref><ref type="bibr">43,</ref><ref type="bibr">78,</ref><ref type="bibr">79]</ref> or trusted rule-based heuristics that have been reliably used in real systems for a long time <ref type="bibr">[21,</ref><ref type="bibr">69]</ref>. These control priors are highly reliable in terms of safety metrics. By integrating these priors into ML-based controllers, we aim to develop an algorithm that ensures the safety performance of the ML-based controller is no worse than a the safety performance benchmark. Drawing on the concept of learning-augmented algorithms that incorporate ML advice into algorithm design, we call our proposed algorithm Learning-Augmented Online Control (LAOC).</p><p>While initially developed for water systems, the proposed algorithm (LAOC) is versatile and can be applied to various practical online control and resource management problems, such as battery management for electric vehicle (EV) charging station <ref type="bibr">[86]</ref>, workload scheduling for sustainable data centers <ref type="bibr">[77]</ref>, and control of cooling systems <ref type="bibr">[69]</ref>. Adaptation of LAOC to these applications can improve the average performance while providing a worst-case performance guarantee.</p><p>Contributions. The contributions of the paper are summarized as follows. First, it presents an online control framework designed to sustainably and safely manage water supply for critical infrastructures. The framework addresses the urgent need for a worst-case safety risk guarantee in decarbonizing critical infrastructures. Notably, this framework extends to various online control and resource management problems across different critical infrastructures. Central to the paper's contribution is the development of a novel learning-augmented algorithm named LAOC, which integrates a control prior into the ML-based controller to ensure worst-case safety risk constraints while optimizing decarbonization performance. Our analysis demonstrates that the proposed method reliably satisfies safety performance constraints for any problem instance while effectively leveraging ML predictions for decarbonization and cost saving. Furthermore, our analysis illuminates the tradeoff between the decarbonization and cost saving performance and the worst-case safety guarantee. Lastly, the paper evaluates the proposed algorithm for the water supply system of critical buildings. Results indicate that LAOC achieves significant carbon reduction and cost savings compared to traditional controllers used in water supply systems focusing on maintaining water levels. Moreover, it showcases the advantage of LAOC in guaranteeing worst-case safety performance compared to pure ML-based algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Optimization of water supply systems. The considered problem stems from the tradition field of water supply management. In this area, a lot of works consider the scheduling for water distribution systems <ref type="bibr">[28,</ref><ref type="bibr">71,</ref><ref type="bibr">81,</ref><ref type="bibr">85]</ref>. Some works have developed the pump control methods to maintain a water level for demand satisfaction and save energy, which has been studied in <ref type="bibr">[19,</ref><ref type="bibr">34,</ref><ref type="bibr">68,</ref><ref type="bibr">75,</ref><ref type="bibr">83,</ref><ref type="bibr">84,</ref><ref type="bibr">93,</ref><ref type="bibr">94]</ref>. Most of these works only consider the energy price, but do not explicitly consider the dynamical carbon intensity. The carbon emission of water infrastructures has recently become a crucial social concern <ref type="bibr">[1,</ref><ref type="bibr">2]</ref>, so we include the carbon emissions in the optimization objective to ensure sustainable operation.</p><p>Much of the literature, e.g., <ref type="bibr">[19,</ref><ref type="bibr">93,</ref><ref type="bibr">94]</ref>, utilizes ML predictions of the future demand and/or energy price to improve the control performance. To fight against the future uncertainty, some works have developed robust control algorithms or constrained control algorithms for water supply systems <ref type="bibr">[35,</ref><ref type="bibr">44,</ref><ref type="bibr">81,</ref><ref type="bibr">85]</ref>. They either satisfy the safety constraints with a large probability or provide no guarantee on safety constraints. However, it is critically needed for water infrastructures to guarantee the worst-case safety performance of water supply given any problem instance. In this paper, we solve this challenge by designing a novel learning-augmented control algorithm utilizing the trusted control prior.</p><p>Online control. Our problem formulation is relevant to the literature of online competitive control. In our problem setting, the target is to minimize the cumulative cost in the nonlinear dynamics, which is different from the traditional control literature that uses measures for stabilization purposes <ref type="bibr">[31,</ref><ref type="bibr">32,</ref><ref type="bibr">51,</ref><ref type="bibr">74]</ref>. Like the recent works on competitive control <ref type="bibr">[38,</ref><ref type="bibr">40,</ref><ref type="bibr">41,</ref><ref type="bibr">43,</ref><ref type="bibr">72,</ref><ref type="bibr">79,</ref><ref type="bibr">101]</ref>, our work considers guarantees on the worst-case competitiveness, but our main focus is different -we leverage ML to explore policies with low average cost while enforcing competitiveness guarantees for any step in any episode. This enables the use of the existing competitive control policies as priors. Achieving our objective requires novel design of safe action sets and new analysis techniques to find the trade-off between the average performance and worst-case competitiveness.</p><p>Learning-based online control. Our algorithm is relevant to the broad area of learning-based control <ref type="bibr">[16,</ref><ref type="bibr">30,</ref><ref type="bibr">33,</ref><ref type="bibr">46,</ref><ref type="bibr">59,</ref><ref type="bibr">62,</ref><ref type="bibr">76,</ref><ref type="bibr">88]</ref>. These works have developed machine learning models to predict the system dynamic or control-relevant information which is utilized in deciding the control actions <ref type="bibr">[13,</ref><ref type="bibr">30,</ref><ref type="bibr">33,</ref><ref type="bibr">48,</ref><ref type="bibr">49,</ref><ref type="bibr">62,</ref><ref type="bibr">88]</ref>. Recent works combine learning-based methods with system models in order to improve the safety or robustness of learning for control <ref type="bibr">[16,</ref><ref type="bibr">30,</ref><ref type="bibr">59,</ref><ref type="bibr">76,</ref><ref type="bibr">89,</ref><ref type="bibr">92]</ref>. Among them, learning-augmented online algorithms combine potentially untrusted ML predictions with robust policies (i.e., control priors). Learning-augmented algorithms have been developed for online control/optimization by combining ML predictions and control priors through online switching <ref type="bibr">[12,</ref><ref type="bibr">76]</ref>  or adaptively setting a confidence on the ML prediction <ref type="bibr">[58,</ref><ref type="bibr">59]</ref>. Compared to these studies, we make contributions by considering a more challenging setting, i.e., non-linear and time-varying dynamic models that are sequentially revealed online. Although some of the existing studies <ref type="bibr">[23,</ref><ref type="bibr">55,</ref><ref type="bibr">57,</ref><ref type="bibr">59,</ref><ref type="bibr">76]</ref> provide provable cost bounds, they cannot guarantee a flexible any-step safety constraint given an arbitrary control prior, but this is needed for real problems <ref type="bibr">[69]</ref>.</p><p>Safe/Constrained Reinforcement Learning Our algorithm is also relevant to the literature of safe/constrained Reinforcement Learning (RL). Some safe/constrained RL works focus on discrete actions and their regret scales with the size of action set <ref type="bibr">[29,</ref><ref type="bibr">66,</ref><ref type="bibr">95]</ref> while others <ref type="bibr">[7,</ref><ref type="bibr">98,</ref><ref type="bibr">99,</ref><ref type="bibr">99]</ref> apply to the continuous control problems. However, most of them only satisfy the constraints in expectation or with a high probability <ref type="bibr">[7,</ref><ref type="bibr">10,</ref><ref type="bibr">18,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">29,</ref><ref type="bibr">36,</ref><ref type="bibr">98,</ref><ref type="bibr">99,</ref><ref type="bibr">99]</ref>. A recent work <ref type="bibr">[82]</ref> tries to solve RL with safety constraints satisfied almost surely, but no theoretical constraint satisfaction is guaranteed. When these algorithms are applied to online control systems like water supply management, the safety constraints can still be violated for some adversarial sequences. By contrast, our algorithm exploits the control priors and provides a theoretical guarantee for the safety constraint satisfaction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Formulation</head><p>In this section, with the water supply management as the application scenario, we present the safe online control model. Next, we show the safe online control model applies to broader applications by specifying the dynamics, loss and risk functions. Finally, we give the assumptions on the dynamics and risk functions required for the algorithm design and analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Safe Online Control for Water Supply</head><p>In this section, we formulate an online control problem with timevarying costs and dynamics that captures the task of water supply management. A problem instance consists of time slots. At the beginning of each time slot &#8462; &#8712; [ ], the controller observes the water level state &#8462; and decides on an action &#8462; &#8712; R to schedule the activation time and/or the speed of the pumps. This action incurs a non-negative carbon emission ( &#8462; , &#8462; ) related to the carbon intensity &#8462; , and a monetary cost ( &#8462; , &#8462; ) related to the energy price &#8462; . Given the water level state &#8462; and the action &#8462; , the system transitions to &#8462;+1 at the end of slot &#8462; following the dynamic function defined as:</p><p>where &#8462; is the water consumption within time slot &#8462;, and maps the control signal &#8462; to the amount of water supply within time slot &#8462;. Note that ( &#8462; ) is a linear function if we only control the activation time of pumping, and it is a nonlinear continuous function if we control the speed of pumps <ref type="bibr">[90]</ref>. The water level state is expected to remain close to a nominal water level &#175; in the water tanks. Deviation from this nominal water level incurs a penalty cost denoted as ( &#8462; ).</p><p>For convenience, we denote &#8462; ( &#8462; , &#8462; , &#8462; ), and so 1: = ( 1 , . . . , ) is the information for the entire episode. The total loss at slot &#8462; is expressed as:</p><p>where 1 , 2 , and 3 are weights used to convert the costs to the same measurement. An online control policy, denoted by , outputs the action &#8462; . The cumulative loss within an episode of time slots, following policy , is expressed as: The o ine optimal loss is denoted by * .</p><p>Safety Constraint. Online control algorithms must guarantee safety performance. For critical infrastructures, water supply management should maintain a safe water level to ensure reliable supply during emergencies. Failure to maintain a safe water level incurs a safety risk, denoted as &#8462; ( &#8462; , &#8462; ). Given a nominal water level &#175; , a concrete form of safety risk can be denoted as</p><p>where dist( &#8462; , &#175; ) is a measure of distance between the water level &#8462; and the nominal water level &#175; , ( &#8462; ) penalizes the power load of the scheduling action &#8462; , and and are balancing weights for the two risk metrics. Note that the distance function dist( &#8462; , &#175; ) can be an asymmetric function which provides different penalties for &#8462; -&#175; &#8804; 0 and &#8462; -&#175; &gt; 0. The asymmetric distance measure is flexible to model different penalties of overly-high and overly-low water levels. We define the total safety risk of a policy over an episode with rounds as = &#8462;=1 &#8462; ( &#8462; , &#8462; ). To evaluate whether a controller is safe, we require a safety benchmark. In this paper, we use the scaled safety risk of an existing safe control prior &#8224; as our benchmark. This means that for any problem instance 1:&#8462; and &#8462; &#8712; [ ], the controller must satisfy the safety constraint expressed as:</p><p>where &#8224; &#8462; is the safety risk of the safe control prior &#8224; and &gt; 0 is a preset parameter indicating the safety requirement level. The constraint in (4) is called (1 + )-safety.</p><p>The intuition behind the safety constraint is that if the control prior has a worst-case safety performance guarantee for any instance, then a policy satisfying this constraint also ensures a performance guarantee adjustable by 1 + . This constraint must be satisfied in each round to provide a strong worst-case guarantee. The safe control prior can be a human-crafted algorithm with a theoretical worst-case performance guarantee or a reliable heuristic implemented in real systems for a long time. In water infrastructures, the control prior can be a traditional controller that is designed to maintain the safe water level <ref type="bibr">[93,</ref><ref type="bibr">94]</ref>.</p><p>Objective. We exploit ML predictions to optimize the expected loss while guaranteeing safety constraint for any problem instance.</p><p>Given a safety requirement &gt; 0, the objective is:</p><p>For convenience, we define the collection of all control policies that satisfies the safety requirement with as</p><p>If is larger, then the size of &#928; is also larger, providing more flexibility to optimize the average loss. To solve this objective, we need to integrate the control prior &#8224; into the ML-based controller.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Broader Applications</head><p>While the formulation is specifically given for water supply management, it applies to many other online control problems by replacing the dynamic function &#8462; , the cost function &#8462; and the risk function &#8462; with concrete expressions. Here, we give the following two application examples.</p><p>&#8226; Battery management of EV charging station. The battery management of Electrical Vehicle (EV) charging station is an online control problem where the agent needs to decide the amount of battery charging or discharging &#8462; at each round to maintain a nominal State of Charge (SoC) &#8462; that satisfies the charging demand <ref type="bibr">[53,</ref><ref type="bibr">59]</ref>. In this problem, the dynamic of SoC &#8462; is modeled by the dynamic function &#8462; ( &#8462; , &#8462; ) = &#8462; + &#8462;&#8462; where &#8462; is the charging demand, the loss function &#8462; defines the cost of charging and discharging. The risk function &#8462; defines the risk of not satisfying the charging demand. Classic controllers <ref type="bibr">[11,</ref><ref type="bibr">100]</ref> can serve as the control prior &#8224; with risk performance guarantee.</p><p>&#8226; Cooling control for sustainable data centers. In this application, the target of the data center agent is to maintain a temperature range with high carbon efficiency by making online decisions of cooling equipment management <ref type="bibr">[22,</ref><ref type="bibr">69,</ref><ref type="bibr">96]</ref>. Failure to maintain a suitable temperature range will overheat the devices and render the risk of critical services denial. The dynamic function &#8462; ( &#8462; , &#8462; ) models the temperature dynamic where &#8462; is the randomness factor affecting the temperature change. The cost function &#8462; captures the losses of carbon emission and energy costs. The risk function &#8462; measures the risk of deviating from the normal temperature range. The traditional rule-based heuristics <ref type="bibr">[69]</ref> that have verified performance in maintaining a suitable temperature can serve as the control prior &#8224; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Assumptions</head><p>In this paper, we assume the following conditions on the dynamic functions and the risk functions.</p><p>Assumption 3.1 (Lipschitz dynamics). For each time &#8462;, the function &#8462; is Lipschitz continuous with respect to &#8462; and &#8462; with Lipschitz constants &#8805; 0 and &#8805; 0, respectively, i.e., for any ( , ) and</p><p>Assumption 3.2 (Well-conditioned risk functions). For each time &#8462;, the risk function &#8462; is non-negative, -strongly convex, and -smooth with respect to ( &#8462; , &#8462; ).</p><p>The first assumption is the Lipschitz continuity of the dynamic functions, which is common in finite-horizon control models <ref type="bibr">[59,</ref><ref type="bibr">60,</ref><ref type="bibr">101]</ref>. For water supply management, the dynamic function &#8462; in (1) is clearly Lipschitz continuous as is a Lipschitz continuous function.</p><p>The second assumption is the non-negativity, convexity and smoothness of the risk functions, which is a common regularity condition in control system costs <ref type="bibr">[60,</ref><ref type="bibr">64,</ref><ref type="bibr">65,</ref><ref type="bibr">78]</ref>. We are flexible to choose different risk functions that satisfy Assumption 3.2. For example, we can choose an asymmetric dist function as</p><p>and a quadratic penalty of the power load, and the obtained risk function satisfies Assumption 3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Learning-Augmented Online Control (LAOC)</head><p>In this section, we present and analyze an algorithm, LAOC, to solve the online control problem introduced in the previous section. Before stating the algorithm, we highlight the challenges created by the safety requirements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Challenges Due to Safety Requirements</head><p>Our goal is to find a policy satisfying safety constraints in (5) while exploiting the ML predictions to achieve a low loss. However, this is very challenging for online control where the future contexts are unknown to the agent.</p><p>One might hope that a straightforward design that considers a linear combination of the control prior and the pure ML action (Lin) would be sufficient. Formally, Lin is defined as = &#732; + (1 -) &#8224; , where &#732; is the pure ML policy and &#8224; is the policy prior. However, Proposition 4.1 shows that, unless we completely ignore the ML policy (i.e., = 0), Lin cannot guarantee (1 + )-safety given any ML policy. Proposition 4.1. Define the quality of pure ML as the normalized difference between the ML advice and the o ine optimal action &#8741; &#732; - * &#8741; 2 / * . If the pure ML have an arbitrarily low quality (i.e., &#8741; &#732; - * &#8741; 2 / * &#8594; &#8734;), Lin with &#8712; (0, 1] cannot guarantee (1 + )safety for any finite &gt; 0.</p><p>Proposition 4.1 is proven by constructing a contradictory example that if (1 + )-safety with finite is satisfied by Lin with &#8712; (0, 1], the quality of ML advice &#8741; &#732; - * &#8741; 2 * must be bounded by a finite value. In other words, (1 + )-safety cannot be satisfied by Lin with a potentially unsafe ML model in the worst case.</p><p>Overcoming the limitation of Lin with respect to safety guarantees requires a more flexible combination of pure ML and the control prior. Thus, we give a second natural approach maps the ML advice into a safe action set defined by the safety constraint for each round &#8462; as</p><p>where &#8462; = &#8462; =0 ( , ) and</p><p>The mapping can be a linear combination that selects the action as</p><p>. We refer to this policy as Lin+.</p><p>Lin+ uses a time-varying combination variable &#175; &#8462; , so it is much more flexible than Lin and can strictly guarantee (1 + )-safety given any instance as long as the safe action set in ( <ref type="formula">6</ref>) is non-empty. Unfortunately, the naive design of safe action set U ,&#8462; in (6) can be empty, which results in no feasible actions. This is illustrated by the following example.</p><p>Example 4.1. Suppose that &#8462; =0 ( ,</p><p>&#8462;+1 holds at round &#8462; + 1, the agent can always choose &#8462;+1 = &#8224; &#8462;+1 to satisfy (6) at round &#8462; + 1. However, when &#8462;+1 &#8800; &#8224; &#8462;+1 , it is possible that the control prior has a low loss for its state &#8224; &#8462;+1 at time &#8462; + 1 such that for any action &#8712; U the true loss &#8462;+1 ( &#8462;+1 , ) is lager than the scaled prior loss</p><p>). In such a case, the naive safe action set U ,&#8462;+1 is empty, and the control agent cannot maintain the inequality in (6), thus potentially violating the subsequent safety constraints.</p><p>The failures of the intuitive policies Lin and Lin+ show that for a policy to combine the ML advice and the control prior, it must be flexible and conservative enough to guarantee that feasible actions exist to meet the safety constraints. In the next section, we give the design that can theoretically guarantee the (1 + )-safety for any sequence and ML advice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Algorithm Design</head><p>In this section, we give an overview of the design of Learning Augmented Online Control (LAOC).</p><p>First, we highlight the design of the safe action set used in the algorithm. Instead of directly guaranteeing the inequality in (6), we ensure that the resulting cumulative loss satisfies &#8462; =0 ( , ) + &#8462; &#8804; (1 + ) &#8462; =0 ( &#8224; , &#8224; ) with an added reservation &#8462; &#8805; 0 for hedging. With a proper design of the reservation, U ,&#8462; , &#8462; &#8712; [&#8462;, ] can be guaranteed to be not empty for all the possible future control environments &#8462;: . To this end, we design a reservation in the next proposition, whose proof is deferred to Appendix C. Proposition 4.2. Define a safe set U ,&#8462; , &gt; 0 as</p><p>where &#8462; = &#8462; =0 ( , ) and</p><p>) are the true loss and the loss of control prior, respectively. Moreover, &#8462; ( )</p><p>With Assumptions 3.1 and 3.2, if U ,&#8462;-1 is not empty and the action &#8462;-1 at round &#8462; -1 is selected from the safe set U ,&#8462;-1 , then U ,&#8462; is not empty and always includes &#8224; &#8462; . The key insight behind the formulation of the reservation &#8462; ( ) in ( <ref type="formula">7</ref>) is to hedge against the possible violation of safety constraints in future rounds. If the resulting state difference &#8741; &#8462;+1 - &#8224; &#8462;+1 &#8741; 2 from choosing &#8462; is greater, the possible loss difference =&#8462;+1 ( , )-(1 + ) ( &#8224; , &#8224; ) in the following rounds can also be greater in the worst case. Thus, the reservation &#8462; ( ) is designed as the scaled Algorithm 1 Learning Augmented Online Control (LAOC)</p><p>Observe state &#8462; , information { &#8462; , &#8462; }, and last-step context &#8462;-1 .</p><p>3:</p><p>Update the policy prior's state</p><p>Obtain an action &#8224; &#8462; by the prior &#8224; , and update prior risk</p><p>Obtain the ML action &#732; &#8462; via the ML model &#732; 6:</p><p>else take &#8462; = ( &#732; &#8462; ) end if // Map to a safe action set U ,&#8462; (7) by ( <ref type="formula">8</ref>) or (9)</p><p>8:</p><p>Update true loss &#8462; = &#8462;-1 + &#8462; ( &#8462; , &#8462; ) and risk &#8462; = &#8462;-1 + &#8462; ( &#8462; , &#8462; ) 9: end for state difference to account for the worst-case future risk difference between the true control policy and the control prior &#8224; .</p><p>As a consequence of Proposition 4.2, if &#8462; is selected from U ,&#8462; for each round &#8462;, there always exists a non-empty safe action set U ,&#8462; in the subsequent steps, and thus (1 + )-safety is strictly satisfied for each round. Based on Proposition 4.2, given an ML policy &#732; and a control prior &#8224; , we design the online learningaugmented control policy LAOC as shown in Algorithm 1. At each round &#8462; within an episode, the controller first evaluates the loss of the control prior. To achieve this, after observing the true state &#8462; and &#8462;-1 , &#8462;-1 , we first calculate a "virtual state" corresponding to the control prior for the same online information 0:&#8462;-1 , denoted by</p><p>Next, we query the control prior &#8224; with a state &#8224; &#8462; and obtain an action &#8224; &#8462; , which can be used to update the cumulative risk &#8224; &#8462; at round &#8462;. By doing so, a safe action set U ,&#8462; can be constructed by Proposition 4.2.</p><p>To utilize the ML advice for loss performance, we select an action that is close enough to the pure ML action from the safe action set. If the ML action &#732; &#8462; is in the safe action set U ,&#8462; , then we simply select &#8462; = &#732; &#8462; . Otherwise, we can use a mapping function : R &#8594; U ,&#8462; that maps the ML action &#732; &#8462; into an action in the safe action set. One choice of is the projection operation which selects action as</p><p>When the safe action set is a convex set (e.g. the dynamic functions <ref type="bibr">[38,</ref><ref type="bibr">40,</ref><ref type="bibr">101,</ref><ref type="bibr">103]</ref>), the projection can be efficiently solved. Otherwise, the complexity can be high especially for high dimensional actions <ref type="bibr">[20,</ref><ref type="bibr">63]</ref>. Under such cases, we can choose as a linear combination as below</p><p>where we need to solve an one-dimensional combination variable</p><p>We will prove in Theorem 4.4 that LAOC with both mapping functions in ( <ref type="formula">8</ref>) and ( <ref type="formula">9</ref>) share the same expected loss bound.</p><p>The time complexity of LAOC is ( ( ML + prior + map )) where ML , prior and map are the time complexities of the ML inference, the control prior and the mapping operations, respectively. ML is determined by the ML architecture. The time complexity of the control prior prior usually increases with the complexity of the control problem. Take the control prior ROBD <ref type="bibr">[41]</ref> as an example, the complexity to solve the optimization in ROBD scales with the dimension of the action. Furthermore, the mapping complexity map depends on the action-state dimensions and the complexity of the control model. If the safe action set in <ref type="bibr">(7)</ref> is convex (e.g. linear dynamic leads to a convex safe action set), we can use a convex optimization solver to efficiently solve the projection in <ref type="bibr">(8)</ref>. When the safe action set is non-convex, the projection into a non-convex action set has a high time complexity. In such cases, we can map the ML action to the safe action set by solving an one-dimensional combination variable in <ref type="bibr">(9)</ref>.</p><p>Safety-aware finetuning. If we have access to the pure ML model, we can finetune it based on available sequence data to further improve average loss performance with the safety guarantee. Specifically, given the pure ML model &#732; which outputs the ML action &#732; &#8462; and the safe action set U ,&#8462; , we finetune the ML model by minimizing the empirical loss of safe actions:</p><p>where D is the finetuning dataset with sequences. To finetune the ML model with <ref type="bibr">(10)</ref>, we can directly perform the backpropagation through the online process where all the operations are differentiable. The projection in (8) can be implicitly differentiated as shown in <ref type="bibr">[8]</ref>. The linear mapping in ( <ref type="formula">9</ref>) is also differentiable by differentiating the equation to solve &#8462; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Performance Bounds</head><p>We provide the performance analysis of LAOC in this section. We first present the conclusion that the the safety constraint in ( <ref type="formula">4</ref>) is always satisfied by LAOC. Next, we give the average performance bound of LAOC under the safety guarantee. Last but not least, we provide the performance bound by safety-aware finetuning in Eqn. <ref type="bibr">(10)</ref>.</p><p>4.3.1 Safety constraint satisfaction. In Proposition 4.2, we prove that the safe set U ,&#8462; in ( <ref type="formula">7</ref>) is not empty for each round &#8462;. Since LAOC (Algorithm 1) guarantees that the action &#8462; lies in the safe set at each round, we can get the conclusion of safety constraint satisfaction in the next theorem.</p><p>Theorem 4.3. By LAOC (Algorithm 1) with safety set U ,&#8462; in (7), for any problem sequence 1: and any round &#8462; &#8712; [ ], we can guarantee that the safety risk constraint in (4) is satisfied.</p><p>Theorem (4.3) highlights that LAOC can strictly guarantee (1 + )safety for any problem instance even when the ML policy &#732; has an arbitrarily bad performance. Under the safety guarantee, we are concerned about the expected loss performance given in the next theorem.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Average performance.</head><p>The expected loss relies heavily on the choices of 1 and 2 in (7) of Proposition 4.2. To see this, if 1 or 2 is larger, the reservation &#8462; ( ) becomes larger, so the safe action set U ,&#8462; contains less feasible actions. Thus, the policy cannot utilize the ML model to improve the average loss performance effectively. On the contrary, if 1 and 2 approach 1, it can happen in the earlier rounds that the sizes of the safe action sets U ,&#8462; are too large and the selected action is too far from the prior action. This results in large state differences between &#8462; and &#8224; &#8462; in future rounds, resulting in small safe action set U ,&#8462; and impeding the exploitation of ML advice. The following analysis formally shows the factors that affect the expected performance and suggest the choices of 1 and 2 .</p><p>Theorem 4.4. Assume that the ML policy &#732; is -Lipschitz continuous and the function &#8462; is -Lipschitz continuous, by optimally choos-</p><p>2 )} in <ref type="bibr">(7)</ref>, the expected loss of LAOC that guarantees (1 + )safety is bounded by</p><p>where &#8462; = &#8741; &#732; ( &#732; &#8462; ) - &#8224; &#8462; &#8741; is the action discrepancy between the pure ML action and the control prior,</p><p>) &#8462;--1 ) are constants of the control system, in which is the smoothness parameter of the risk function &#8462; , is the size of the state-action set, is the Lipschitz constant of the ML advice policy &#732; , and are the Lipschitz constants of the dynamics model &#8462; .</p><p>The expected loss bound in Theorem 4.4 relies on the choices of 1 and 2 . When becomes larger, the safety constraint is more relaxed, so a smaller 1 is chosen to get a smaller reservation (&#8462;) in Proposition 4.2, allowing more flexibility to follow the ML advice. Also, 2 is selected to alleviate the impact of the dynamic sensitivity measured by and (Assumption 3.1) on the expected loss. The expected loss bound in Theorem 4.4 can be interpreted as follows. First, the safety constraint naturally creates a gap of expected loss between LAOC and the ML advice &#732; . More specifically, given a control prior &#8224; , when &gt; 0 becomes smaller, the safety constraint is more stringent, which thus makes the actions of LAOC potentially deviate more from those of the ML advice policy &#732; and increases the bound in <ref type="bibr">(11)</ref>. On the contrary, when &gt; 0 becomes larger, the safety constraint is more relaxed, reducing the expected loss of LAOC . In particular, if is sufficiently large, the term</p><p>+ can reduce to zero, voiding the safety constraint and resulting in the same expected loss as pure ML. Additionally, the expected loss is affected by the action discrepancy &#8462; because a larger &#8462; means larger difference between the prior &#8224; and the ML model &#732; , naturally making it more difficult for LAOC to approach ML &#732; while satisfying safety constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>4.3.3</head><p>Generalization performance of safety-aware finetuning. In this section, we consider the case in which the ML policy in LAOC is trained on <ref type="bibr">(10)</ref>. We bound the average loss gap between LAOC policy and the unconstrained-optimal policy * . We denote ] in the following theorem.</p><p>Theorem 4.5. If ML policy is trained by the loss function in Eqn. <ref type="bibr">(10)</ref> with a training dataset with samples, with probability at least 1 -, &#8712; (0, 1), the expected loss of our competitiveness-constrained policy ( ) is bounded by</p><p>,</p><p>where the system-related parameters , and &#8462; have the same definition as in Theorem 4.4, ( , &#928; , &#710; 1 ) is the -covering number of the competitive policy space &#928; with 1 -norm as the distance measure (the distance of two policies and</p><p>&#8462; )&#8741; 1 ) on the training dataset D , and O indicates the scaling with the loss upper bound , the horizon , and the size of action-state space X &#215; U. Theorem 4.5 shows that as the number of training samples &#8594; &#8734;, the expected loss is bounded by the unconstrained-optimal expected loss E * plus an additional term relying on the expected loss of the prior &#8224; and the parameter &gt; 0. This additional term is because the policy is optimized under the safety constraint in (4). When becomes larger, the constraint is more relaxed and the expected loss is closer to the unconstrained-optimal expected loss. Also, Theorem 4.5 shows that our policy with the onlinetrained ML model converges with a rate of 1/ . In particular, the convergence rate is affected by through the covering number ( , &#928; , &#710; 1 ) which indicates the richness of the competitive policy class &#928; . Comparing to the unconstrained policy set &#928; &#8734; , the covering number of the competitive policy class &#928; is smaller. This is because with the same ML model, the safety constraint reduces the set of feasible actions -with a smaller &gt; 0, the safe policy space becomes smaller, making it easier for the convergence of LAOC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Case Study</head><p>In this section, we evaluate the performance of LAOC by experiments on a concrete water supply case and compare LAOC with different control baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Setup</head><p>In this section, we provide the experimental setups on the water supply management. We first present the architecture of water supply system with roof top water tanks. Next, we introduce the datasets used in the experiments including the traces of water demand, carbon intensity, and energy price. Following that, we define the concerned performance metrics in the experiments. Finally, we provide the settings of LAOC and the baselines.</p><p>5.1.1 Water supply system with roof top water tanks. The water supply systems of many modern buildings are equipped with roof top water tanks. The roof top water tanks have large water storage capacities and exploit the gravity in the elevated level to supply water for building users. Water is pumped from municipal water sources to these roof top water tanks to maintain a water level. The water tanks can play an important part in sustaining water supply system because the manager can pump less water (by decreasing the activation time and/or the speed of pumps) to the water tanks when the carbon intensity and energy price are high while still satisfying the demand using the water stored in the water tanks. Beyond that, the water tanks are crucial for the safety of the buildings because they are equipped to supply water for fire protection systems and the mission-critical functions of the buildings. Therefore, we must make sure that the water level in a water tank is not far from its nominal water level to meet the safety requirements.</p><p>To sustain the water supply system for a building with roof top water tanks, we need to know the energy consumption of its pumping system to pump water to the water tanks. In this paper, we estimate the power pump (kW) of water pumping by the following formula converted from the horsepower formula used in engineering practice <ref type="bibr">[4]</ref>:</p><p>where WF(L/s) is the water volume flow, HD(m) is the height of the water tank, and SG = 1 is the water specific gravity, and pump is the power efficiency of the pumping system. We develop a controller that decides the amount of water &#8462; (m 3 ) pumped into the water tanks in each hour round &#8462;. The effective water flow is &#8462; /3.6 (L/s)<ref type="foot">foot_0</ref> , which corresponds to an energy consumption of ( &#8462; &#215; HD &#215; SG)/(367.2 &#8226; pump ) (in kWh) by <ref type="bibr">(12)</ref>. Here, we define the energy efficiency as the energy consumption to pump a unit m 3 of water to the water tank in one hour and denote it as</p><p>The setups in the experiment are given as below. The buildings are 75 m high and have water tanks with a total volume of 80 3 on the roof. Each control horizon has a span of 24 hours. By <ref type="bibr">(13)</ref>, the power consumption to pump a unit m 3 of water is = 0.272 kWh/m 3 by choosing the energy efficiency of the pumps as pump = 75% according to <ref type="bibr">[91]</ref>. The controller decides the amount of water pumped into the water tanks in each hour as &#8462; ( 3 ), so the energy consumption at hour round &#8462; is &#8226; &#8462; (kWh).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2">Performance Metrics.</head><p>In water supply management, the concerned performance metrics include the carbon and energy costs and the safety risk. Given the water supply system with roof top water tanks, the expressions of the objectives are given as below.</p><p>Carbon cost. Given an action &#8462; which is the the amount of pumped water at hour round &#8462;, the energy consumption is &#8226; &#8462; (kWh). Therefore, given the carbon intensity &#8462; (g/kWh) at round &#8462;, the carbon emission at round &#8462; is</p><p>Energy cost. With the energy price &#8462; for round &#8462;, the total energy cost at round &#8462; is</p><p>Deviation from the safe level. We choose the nominal safe water level as &#175; = 40m 3 (half of the total water tank capacity). We choose a quadratic penalty for water level deviation which restrains large deviation. Thus, the deviation is measured by the quadratic deviation from the nominal level &#175; , i.e.</p><p>The loss function is a weighted combination of the deviation and the carbon and energy costs which is expressed as</p><p>We consider the expected loss</p><p>&#8462;=1 &#8462; ( &#8462; , &#8462; ) where the expectation is taken on the distribution of the water demand, carbon intensity and energy price traces.</p><p>Safety risk. The safety risk is determined by the deviation and the hourly energy consumption. A high deviation will increase the risk of not satisfying the water demand and a large energy consumption can add too much power load to the energy system. We consider a quadratic penalty to restrain large deviation and hourly energy consumption and the safety risk is expressed as</p><p>In some scenarios, we need to consider different penalties for overly-low and overly-high water levels and model the deviation as an asymmetric function. If the asymmetric function satisfies Assumption 3.2 (e.g. an asymmetric dist function as</p><p>, the theoretical conclusions of LAOC still hold, and the key observations of experiments also generalize to such asymmetric penalties.</p><p>Given a control prior &#8224; , we consider (1 + )-safety which guarantees for any sequence that the safety risk is always bounded by the scaled safety risk of &#8224; , i.e. &#8704;&#8462; &#8712; [ ], &#8704; 1: &#8712; Y, &#8462; &#8804; (1 + ) &#8224; &#8462; . We also directly evaluate the safety risk performance by the maximum risk ratio on the testing dataset max 1: &#8712; D test / &#8224; , which is a commonly used metric for worst case performance. <ref type="bibr">[40,</ref><ref type="bibr">42,</ref><ref type="bibr">78]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.1.3</head><p>Water demand, carbon intensity and energy price. The experiments are conducted based on some public datasets. We provide the details for the traces of water demand, carbon intensity and energy price as below.</p><p>Water demand trace. The consumed water at each hour round &#8462; is &#8462; and affects the water level dynamics through <ref type="bibr">(1)</ref>. In our experiments, we use the water demand dataset measured for university buildings in <ref type="bibr">[14]</ref>. For each building, the trace contains hourly water consumption from August 1st, 2018 to December 8th, 2018. Since the traces are measured on low-rise university buildings, we scale up the hourly water consumption by 10 to simulate the highrise building with dense occupancy. The water consumption data of four residence hall is used for training the ML model for water supply management. We augment the water consumption data of another two residence halls and get the 1-year demand traces for 20 buildings which are held out for testing.</p><p>To further evaluate the robustness of the algorithms, we also create an Out-Of-Distribution (OOD) testing dataset on the basis of the original testing dataset. We generate the OOD demand dataset by adding Gaussian noise to each sample in the original dataset. The standard deviation of the Gaussian noise is set as 30% of the maximum demand value.</p><p>Carbon intensity trace The carbon intensity datasets are from California Independent System Operator (CAISO) which are published on the website of Electricity Maps <ref type="bibr">[70]</ref>. The carbon intensity datasets contain the hourly carbon intensity of a city in California.</p><p>We use the carbon traces in 2022 to train the ML model, and we hold out the carbon traces in 2023 for validation and testing.</p><p>Energy price trace The electricity price datasets are from CAISO which are published on the website of Energy Online <ref type="bibr">[5]</ref>. Each price trace in the dataset contains the energy price value every 5 mins. We convert the original traces into hourly price traces by calculating the average price within each hour. We use the price data in 2022 to train the ML model while holding out the price data in 2023 for validation and testing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.4">Se ings of LAOC.</head><p>To implement LAOC in Algorithm 1, we need an ML model &#732; and a control prior &#8224; as inputs. Also, we can perform safety-aware finetuning in Eqn. <ref type="bibr">(10)</ref> to learn an ML model for Algorithm 1. Thus, we summarize the variants of LAOC as follows.</p><p>&#8226; LAOC ( , &#732; , &#8224; ): We use ML model &#732; and control prior &#8224; as the inputs of LAOC. &#732; and &#8224; can be replaced with a concrete ML model and a concrete control prior, respectively.</p><p>If not specified, LAOC uses Online Gradient Descent (OGD) as the control prior by default. If not specified, LAOC uses the ML model purely trained without considering the safety constraint by default. determines the (1 + )-safety in (4). &#8226; LAOC-F( , &#8224; ): We use control prior &#8224; and an ML model obtained by safety-aware finetuning in Eqn. <ref type="bibr">(10)</ref> as the inputs of LAOC. If not specified, LAOC-F uses Online Gradient Descent (OGD) as the control prior by default. determines the (1 + )-safety in (4).</p><p>ML model. The ML model for LAOC is a recurrent neural network. It takes available information about demand, carbon intensity, and electricity price as inputs and outputs the action for each round. By default, the ML model has 2 hidden layers and each hidden layer has 12 neurons. The ML model is trained by the Adam optimizer with a learning rate 5 &#215; 10 -4 for 400 epochs.</p><p>Control prior. The control prior can be selected from some controllers that focus on reducing the safety risk. <ref type="bibr">[93,</ref><ref type="bibr">94]</ref>. Some robust online optimization algorithms such as Online Gradient Descent (OGD) <ref type="bibr">[24]</ref>, Online Balanced Descent (ROBD) <ref type="bibr">[41]</ref> can be applied to optimize the safety risk, so they can serve as the control prior. Alternatively, we can apply Model Predictive Control (MPC) <ref type="bibr">[93,</ref><ref type="bibr">94]</ref> to minimize the safety risk which is commonly used for water supply control as a control prior.</p><p>Regarding the safe set in <ref type="bibr">(7)</ref>, the safety requirement parameter is chosen from [0, 2]. 1 and 2 are chosen based on Theorem 4.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.5">Baselines.</head><p>We compare LAOC with OGD <ref type="bibr">[24]</ref>, ROBD <ref type="bibr">[41]</ref> and MPC <ref type="bibr">[94]</ref> that focus on the safety risk, the pure ML that is trained on the average loss, and the naive learning-augmented design Lin.</p><p>&#8226; O ine Optimal Policy (OPT): This is the optimal o ine policy that knows all the information in advance and obtains the optimal action for each episode.</p><p>&#8226; Online Gradient Descent (OGD): Online gradient descent <ref type="bibr">[24]</ref> is an online algorithm to minimize the safety risk without relying on any predictions. OGD has provable regret bound and competitive ratio with proper choice of step size. We use OGD as a control policy prior by default.</p><p>&#8226; Regularized Online Balanced Descent (ROBD): ROBD is an online optimization algorithm to minimize the safety risk with one-step  demand prediction. It enjoys provable competitive ratio given perfect one-step prediction <ref type="bibr">[41,</ref><ref type="bibr">72]</ref>. We use ROBD as a control policy prior. We set the parameters for ROBD optimally according to <ref type="bibr">[41]</ref>.</p><p>&#8226; Model Predictive Control (MPC): MPC <ref type="bibr">[17]</ref> solves the control problem by leveraging predictions of the future information. Here, we assume that at round &#8462;, the information &#8462;: is predicted as &#710; &#8462;: , and the per-round prediction error normalized by the maximum input range is = E 1 ( -&#8462;+1) ( max ) &#8741; &#8462;: -&#710; &#8462;: &#8741; . In this paper, we use MPC with a window size of 4 hours as a control prior to minimize the risk. MPC -is MPC with a generated prediction error of .</p><p>&#8226; Model Predictive Control with LSTM (MPC-LSTM): Due to the powerful time series prediction ability, Long Short-Term Memory (LSTM) has been utilized as a prediction model in MPC in recent studies <ref type="bibr">[47,</ref><ref type="bibr">50,</ref><ref type="bibr">102]</ref>. In the water supply control problem, we implement a LSTM model as the predictor and apply it in MPC, which is called MPC-LSTM. The LSTM model has one LSTM layer with 60 hidden neurons and can predict the demand in the future 4 hours. The same training dataset for ML is used for LSTM training.</p><p>&#8226; Tube-based Model Predictive Control (TMPC): TMPC [67, 80] is a computationally efficient robust MPC approach which creates state constraints (tube) based on a nominal dynamic model. TMPC makes sure that the true state of MPC stays within the tube. Since the nominal states are assumed to satisfy the constraint, TMPC can also guarantee a constraint. In our experiments, we design a tube based on a nominal dynamic model exploiting the expected demand information. On the basis of existing TMPC <ref type="bibr">[67,</ref><ref type="bibr">80]</ref>, we utilize the LSTM predictor in deciding an action while guaranteeing the action stays in the created tube.</p><p>&#8226; Machine Learning (ML): This is the purely-trained ML model without safety constraints for any episode. For fair comparison, we use the same neural architecture for pure ML and LAOC.</p><p>&#8226; Constrained Reinforcement Learning (CRL): As an important safe reinforcement learning algorithm, CRL <ref type="bibr">[37,</ref><ref type="bibr">73]</ref> has been applied for control problems. Most CRL methods guarantee a constraint in expectation or with a high probability. In our experiments, we implement CRL to satisfy the expected safety risk constraint E[ &#8462; -(1 + ) &#8224; &#8462; ] &#8804; 0 with = 0. Not like original model-free CRL, we exploit the dynamic model information for value estimation in RL.</p><p>&#8226; Linear Combination (Lin): Lin-is the policy in Proposition 4.1 that linearly combines ML advice &#732; and ROBD &#8224; as = &#732; + (1 -) &#8224; with a combination factor &#8712; [0, 1].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Results</head><p>We provide our main results for the default setting in Table <ref type="table">1</ref>. We give the average energy cost, the average carbon cost, and the maximum risk ratio for the control priors, the pure ML model and the learning-augmented algorithms including Lin and LAOC. The results are evaluated for 20 buildings in one year.</p><p>First, we can find that the control prior ROBD achieves the lowest safety risk which requires an accurate one-step prediction of the water demand. The control prior OGD which does not rely on any prediction also achieves relatively low risk. However, the average energy costs and carbon emission are relatively large. Assuming a nearly-accurate predictor with a prediction error of 0.03, MPC-0.03 can achieve a maximum risk ratio of 2.52, an average carbon cost of 17782 kg, and an average energy cost of 6924 $. Thus, MPC has good risk and cost performances when a nearly-accurate predictor is applied. However, a real predictor such as LSTM in our experiments can have large prediction error. The LSTM in our experiment has an average prediction error of 0.05. This results in a higher safety risk and larger average energy cost and carbon emission as is shown by the performance of MPC-LSTM in Table <ref type="table">1</ref>. This shows that the performance of MPC is largely affected by the quality of the predictor. Furthermore, we can observe from Table <ref type="table">1</ref> that as a robust MPC method, TMPC can effectively reduce the safety risk, but it has much higher average energy costs and carbon emission.</p><p>Different from control priors, the pure ML policy has the lowest energy cost and carbon emission, but it has much higher safety risk ratio. This is because the ML models are trained to optimize the average performances, but they can have arbitrarily bad performance when the adversarial instances exist. With the expected safety constraints, CRL can reduce the safety risk while sacrificing some average cost performance. However, we can observe from Table <ref type="table">1</ref> that the worst-case risk of CRL can still be very high, which is due to the existence of adversarial instances. The vulnerability of ML and CRL impedes their deployments in real water supply systems which are critical for the safety of the buildings.</p><p>The learning-augmented designs are given to achieve a tradeoff between the average performance and the worst-case risk. As a naive learning-augmented design, Lin can reduce the safety risk to some extent by choosing a proper combination weight . However, we can find that if we choose a large weight for ML model (e.g. = 0.5 in Table <ref type="table">1</ref>), the average performance can be good but the maximum risk ratio is still very high. Actually, by Lin-0.5, the</p><p>0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Violation probability ML (Ep. 150) ML (Ep. 400) Linear-0.5 Linear-0.2 MPC-LSTM TMPC CRL LAOC (a) Safety violation w/ 0.0 0.5 1.0 1.5 2.0 16.0 16.5 17.0 17.5 18.0 Avg carbon cost (ton) ML (Ep. 150) ML (Ep. 400) LAOC (ML (Ep. 200)) LAOC (ML (Ep. 400)) OGD (b) Average carbon cost 0.0 0.5 1.0 1.5 2.0 6.2 6.6 7.0 7.4 Avg energy cost (k$) ML (Ep. 150) ML (Ep. 400) LAOC (ML (Ep. 200)) LAOC (ML (Ep. 400)) OGD (c) Average energy cost (1 + )-safety constraint is violated with a high probability as we will show in Figure <ref type="figure">2</ref>(a). If we set a small weight for ML model (e.g. = 0.2 in Table <ref type="table">1</ref>), we can get a low safety risk ratio, but the average costs becomes very large. LAOC is designed to optimizing the average performance while guaranteeing the (1+ )-safety constraint in (4). We can observe that with a higher safety requirement ( . . = 0.4), the safety risk ratio is low and close to that of control priors while the average costs are much lower than those of Lin. Also, with a lower safety requirement ( . . = 0.8), the average costs of LAOC are low and close to those of pure ML, and the risk ratio is also much lower than pure ML because the (1 + )-safety constraint is always satisfied. Next, we provide more details as below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1">Safety Violation.</head><p>The safety violation probability on testing dataset is given in Figure <ref type="figure">2</ref>(a). The violation probability is the ratio of the number of safety constraint violation instances to the total testing instance number. A higher in (1 + )-safety in (4) gives a less strict safety constraint, so the violation probability decreases with . We can observe that ML can have a high safety violation probability even when is large. If the ML model is not sufficiently trained (e.g. ML model at Epoch 150 in Figure <ref type="figure">2</ref>(a)), the safety violation probability is even higher. These show that pure ML is not safe enough for water supply systems. Although CRL reduces the safety violation probability comparing to ML, it still has a high safety violation rate. Moreover, MPC-LSTM has a high safety violation rates due to the lack of prediction performance guarantee, and TMPC has a reduced but non-zero safety violation probability. As a learning-augmented design, Lin can also violates safety constraint especially when the safety requirement is high (small ). Decreasing the combination weight for ML model from 0.5 to 0.2 can reduce the violation probability, but this results in a large increase of average costs shown in Table <ref type="table">1</ref>. By contrast, LAOC never violates safety constraint given any problem instance and any safety requirement parameter , which validates the effectiveness of LAOC in strictly guaranteeing the safety constraint as proved in Theorem 4.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.2.2</head><p>Cost-safety tradeoff. Figure <ref type="figure">2</ref>(b) and Figure <ref type="figure">2</ref>(c) demonstrates the tradeoff between the average costs and safety requirement for LAOC. The preset parameter in the safety constraint (4) indicates the level of safety requirement: with smaller , the safety constraint becomes more strict. When = 0, the safety constraint is so strict that LAOC reduces to the control prior OGD which is the default control prior used in LAOC. Thus, the carbon and energy costs of LAOC is the same as those of OGD. When becomes larger, (1 + )-safety constraint (4) becomes less strict, the average costs of LAOC approaches the average costs of corresponding pure ML models, so LAOC can achieve less carbon and energy costs. When becomes large enough, we can observe that the average costs of LAOC are the same as those of pure ML model. The carbon and energy costs also show the impacts of the ML quality. The ML model at Epoch 400 is better than the ML model at Epoch 200, so the average costs of LAOC with ML model at Epoch 400 are lower than those of LAOC with ML model at Epoch 200. These observations coincide with Theorem 4.4 which theoretically shows the tradeoff between the average costs and safety requirement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.3">OOD Testing.</head><p>To further evaluate the robustness of the algorithms, we give results under the Out of Distribution (OOD) setting in Table <ref type="table">2</ref>. We generate the OOD testing demand sequences by adding Gaussian noise to the original demand sequences. We can find that the control priors ROBD and OGD both achieve low enough safety risk ratio, but their average costs are very high. The LSTM predictor is largely affected by the OOD testing, causing a very high safety risk for MPC-LSTM. TMPC can be applied to reduce the safety risk to some extent, but the worst-case safety risk is still very high. This is because the nominal model in TMPC cannot define a robust enough tube in OOD setting. The pure ML policy and CRL policy are also largely affected by OOD testing. We can observe from Table <ref type="table">2</ref> that ML and CRL both have high safety risk in the worst case. The expected safety constraint satisfaction in CRL does not help a lot in OOD testing because CRL is trained on a distribution that is very different from the testing distribution. That being said, ML still achieves the lowest average energy and carbon costs.</p><p>The learning-augmented designs that combine ML with control priors can take an effect in achieving a low enough safety risk. Even Lin can achieve a low safety risk by choosing a good combination weight . However, Lin has high average energy and carbon costs because it is limited in exploiting the ML predictions. Comparably, LAOC (e.g. = 0.4) not only guarantees a small enough risk for any problem instance, but also achieves low average energy and carbon costs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Concluding Remarks</head><p>This work considers an online control problem for water supply management. Besides minimizing the average energy cost, we consider the safety constraint against a given control prior. We design a learning-augmented algorithm, LAOC, that strictly ensure safety constraint. Our analysis reveals the tradeoff between the cost performance and the safety requirement. We evaluate the performance for a case study of building water supply, showing the superiority of LAOC in reducing energy cost and carbon emission and guaranteeing the safety requirements. In the future, the proposed design can be extended to broader applications such as EV charging and sustainable data centers to improve the efficiency and provide safety guarantee for these systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix A Additional Numerical Results</head><p>In this section, we give more numerical results of the case study in Section 5. We first give more details on the testing loss for different training epochs and safety parameter followed by the maximum risk ratio with different . Then, we provide an ablation study on the impact of different control priors on LAOC. Next, we show the safety violation probability under the OOD setting. Finally, we give an instance study to explain LAOC intuitively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1 Training and Testing Details</head><p>A.1.1 Convergence. In Figure <ref type="figure">3</ref>(a), we show the average testing losses as the training evolves. We show the training sequences of pure ML (blue curve) and the safety-aware fine-tuning of LAOC-F (orange curve), respectively. LAOC (ML) (green curve) takes the purely trained ML model at the corresponding epoch as input. The average loss is normalized by the average loss of the optimal policy, i.e. E[ ]/E[ * ]. The testing losses converge after 400 epochs. We can find that ML purely trained without considering safety has the best testing loss convergence. Due to the safety constraint, the testing loss of LAOC (ML) with the purely-trained ML model as input increases a lot. By the safety-aware finetuning in <ref type="bibr">(10)</ref>, LAOC-F effectively reduces the testing loss of LAOC because the safety-aware finetuning is performed on an objective that takes the safety set <ref type="bibr">(7)</ref> into consideration, which validates the conclusion in Theorem 4.5.</p><p>A.1.2 Testing loss with respect to . Figure <ref type="figure">3</ref>(b) shows the the average testing loss changing with the safety parameter in the safety constraint (4) for LAOC. The average testing loss is the weighted combination of the energy cost, carbon cost and the deviation penalty, and is normalized by the average loss of the optimal policy. When becomes larger, (1 + )-safety constraint (4) becomes less strict, the average loss of LAOC approaches the average loss of corresponding purely-trained ML. When = 0, the safety constraint is the strictest and LAOC reduces to the control prior OGD. These observations coincide with the cost bound in Theorem 4.4. Additionally, we evaluate the average testing loss of Lin-0.2 and find that although Lin-0.2 has low safety violation probability, it is so conservative that average loss is very high. These validate the superiority of LAOC in achieving a low enough average loss while guaranteeing safety.</p><p>A.1.3 Maximum risk ratio with respect to . In Figure <ref type="figure">3</ref>(c), we show the worst-case risk ratio changing with the safety parameter in the safety constraint <ref type="bibr">(4)</ref>. If the safety requirement parameter becomes larger, LAOC will take greater risks. Nevertheless, the risk is still lower than purely-trained ML even with very large . These results show the advantage of LAOC in decarbonizing the water supply systems under the safety guarantee.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 LAOC with different control priors</head><p>In Figure <ref type="figure">4</ref>, we give the average costs of LAOC using different control priors (OGD,ROBD,MPC). Here, MPC represents MPC-0.03 with a generated prediction error of 0.03. MPC-0.03 can achieve a maximum risk ratio of 2.52, an average carbon cost of 17782 kg, and an average energy cost of 6924 $. By the performance bound in Theorem 4.4, the expected loss is affected by the per-round risk performance of the control prior &#8224; &#8462; and the action discrepancy &#8462; between the pure ML action and the control prior. As shown in Table <ref type="table">1</ref>, ROBD has the lowest worst-case risk which defines the most stringent safety constraint, so LAOC (ROBD) has larger average loss and larger carbon/energy costs than LAOC with the other two priors. We also observe that although OGD has the largest average carbon/energy costs, LAOC (OGD) can achieve low carbon/energy costs a when is slightly larger. This is because the safety constraint is defined by the risk of OGD which is higher than that of ROBD. No matter which control prior is considered, LAOC can always guarantee the (1 + )-safety constraint with respect to the control prior.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Safety Violation Probability Under OOD Setting</head><p>Under the OOD seeting, the violation rates of safety constraint (4) with respect to the control prior OGD are given in Figure <ref type="figure">5</ref>. A higher in (1 + )-safety in (4) gives a less strict safety constraint, so the violation probability decreases with . We can observe that MPC-LSTM is largely affected by the distribution shift and has the highest safety violation probability. TMPC reduces the violation probability but still has a large violation probability. Both ML and CRL have non-zero violation probability. We can find that the violation probability of CRL is even larger than the violation probability of ML when is small. The ineffectiveness of CRL is because CRL guarantees an expected constraint on the training distribution but the testing distribution has been very different from the training distribution.</p><p>As a learning-augmented design, Lin can achieve low safety constraint violation rate by choosing a small enough combination weight, but this results in a large increase of average costs shown in Table <ref type="table">2</ref>. By contrast, even in the OOD setting, LAOC never violates safety constraint given any problem instance and any safety requirement parameter , which validates the effectiveness of LAOC in strictly guaranteeing the safety constraint as proved in Theorem 4.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 Instance study</head><p>In Figure <ref type="figure">6</ref>, we give a snapshot of a problem instance with 24 hours to get better intuitions on the control process. Figure <ref type="figure">6(a)</ref> shows the traces of carbon intensity, energy price, and water demand of the instance. From Figure <ref type="figure">6</ref>(b), we can observe that ML chooses to delay the water supply when the carbon intensity or energy price is high. ML tends to schedule a large water supply when the carbon intensity or energy price is relatively low. This shows the effectiveness of ML policy in utilizing the water tank to save the energy costs by buffering the demand. However, from Figure <ref type="figure">6</ref>(c), we can find that the water level of ML can be very low at some hour. In this instance, the water level by ML can reduce to 10 3 which is much lower than the nominal safe water level &#175; = 40 3 . This results in a high safety risk since the water is not enough when there is an emergency in the building. Comparably, the control priors OGD and ROBD take much more conservative action shown in Figure <ref type="figure">6</ref>(b) and maintain the nominal water level very well shown in Figure <ref type="figure">6</ref>(c). However, they are limited in predicting and exploiting the time-varying energy price and carbon intensity, thus ineffective in saving energy costs and reducing carbon emissions. Different from them, the proposed algorithm LAOC ( = 0.8) achieves a good</p><p>50 100 150 200 250 300 350 400 Epoch 1.0 1.1 1.2 1.3 1.4 Avg testing loss ML LAOC (ML) LAOC-F (a) Average cost w/ Epoch 0.0 0.5 1.0 1.5 2.0 1.0 1.2 1.4 Avg testing loss ML (Ep. 150) ML (Ep. 400) LAOC (ML (Ep. 150)) LAOC (ML (Ep. 400)) OGD Linear-0.2 (b) Average Loss w/ 0.0 0.5 1.0 1.5 2.0 2 4 6 8 10 Max risk ratio ML (Ep. 150) ML (Ep. 400) LAOC (ML (Ep. 200)) LAOC (ML (Ep. 400)) OGD (c) Max risk ratio 0.0 0.5 1.0 1.5 2.0 1.0 1.2 1.4 1.6 Avg loss ML (Ep. 400) LAOC (OGD) LAOC (ROBD) LAOC (MPC) OGD ROBD MPC (a) Average loss 0.0 0.5 1.0 1.5 2.0 16.0 16.5 17.0 17.5 18.0 18.5 Avg carbon cost(ton) ML (Ep. 400) LAOC (OGD) LAOC (ROBD) LAOC (MPC) OGD ROBD MPC (b) Average carbon emission 0.0 0.5 1.0 1.5 2.0 6.2 6.6 7.0 7.4 Avg energy cost (k$) ML (Ep. 400) LAOC (OGD) LAOC (ROBD) LAOC (MPC) OGD ROBD MPC (c) Average energy cost 0.0 0.5 1.0 1.5 2.0 0.0 0.2 0.4 0.6 0.8 1.0 Violation probability ML (Ep. 150) ML (Ep. 400) Linear-0.5 Linear-0.2 MPC TMPC CRL LAOC Figure 5: Safety Violation Probability Under OOD Setting. trade-off between safety and costs. It can maintain a water level not far from nominal water level (orange curve in Figure 6(c)), so the safety risk of LAOC is low. At the same time, LAOC regulates the water supply aware of the time-varying carbon intensity and energy price (orange curve in Figure 6(b)), so it is also effective in saving energy costs and reducing carbon emissions. B Proof of Proposition 4.1</p><p>Proof. We prove by providing a contradictory example. In this example, the dynamic function is linear, i.e. &#8462; ( , ) = &#8462; + &#8462; + &#8462; , and the control prior has a competitive ratio of &#8224; (i.e.</p><p>&#8224; * &#8804; &#8224; ). We prove that at least for this example, -competitiveness is not guaranteed by Lin.</p><p>If Lin guaranteescompetitiveness, since the competitive ratio of &#8224; is &#8224; , we must have</p><p>Since the risk function &#8462; is -strongly convex and the dynamic function is linear, the total risk is also strongly convex with parameter . By the smoothness of the cost function, we have &#9661; * * = 0, and so</p><p>where the inequality holds by -strongly convexity of ( ). Substituting ( <ref type="formula">17</ref>) into ( <ref type="formula">16</ref>), we have</p><p>and by moving items and the triangle inequality, we have</p><p>Applying the -strongly convex of ( ) and &#9661; * * = 0 again, we have</p><p>Substituting ( <ref type="formula">20</ref>) into ( <ref type="formula">19</ref>), we have</p><p>If Lin guarantees thecompetitiveness, then the ML advice must satisfy</p><p>) 0 5 10 15 20 Time (hours) 0.00 0.05 0.10 0.15 0.20 0.25 Carbon (kgCO eq/kWh) 0.00 0.05 0.10 0.15 0.20 0.25 Price ($/kWh) 0 10 20 Demand (m 3 ) (a) Context sequence 0 5 10 15 20 Time (hours) 0 10 20 Action (m 3 ) ML OGD ROBD LOAC( = 0.8) (b) Action sequence 0 5 10 15 20 Time (hours) 10 20 30 40 50 Water Level (m 3 ) ML OGD ROBD LOAC( = 0.8) (c) Water level sequence Given &#8712; (0, 1] and finite &#8224; , the right-hand-side is a finite value. Thus, when &#8800; 0, for arbitrary ML advice with unbounded  <ref type="bibr">45]</ref>). For any convex andcost function with respect to its input ( , ), it holds for a parameter &gt; 0 that,</p><p>Proof of Proposition 4.2</p><p>Proof. Note that &#8462; is non-negative. Thus, if the safe action set U ,&#8462; in ( <ref type="formula">7</ref>) is non-empty for each &#8462; &#8712; [ ], then we can always guarantee the competitiveness in Eqn. ( <ref type="formula">5</ref>) by selecting an action in U ,&#8462; in Algorithm 1. Then we prove the non-empty of safe action set U ,&#8462; by induction.</p><p>First of all, U ,0 is not empty because &#8224; 0 is always in U ,0 . Then assuming U ,&#8462;-1 is not empty, we prove U ,&#8462; is not empty and at least contain an action &#8224; &#8462; as follows. Since U ,&#8462;-1 is not empty, we have &#8462;-1 &#8712; U ,&#8462;-1 by Algorithm 1, and it holds that &#8462;-1 + &#8462;-1 ( &#8462;-1 ) &#8804; (1 + ) &#8224; &#8462;-1 .</p><p>Thus if &#8462; = &#8224; &#8462; , we have</p><p>where the second inequality holds by Lemma C.1.</p><p>Since the reservation cost is chosen as</p><p>where the first inequality comes from the Lipschitz continuity of dynamic &#8462; , and the second inequality holds by the choice of 2 ) &#8462; &#8242; for some constant 1 &#8805; 1 <ref type="formula">27</ref>) into <ref type="bibr">(25)</ref>, it holds for &#8462; = &#8224; &#8462; that &#8462;-1 + ( &#8462; , &#8224; &#8462; ) + &#8462; ( &#8224; &#8462; ) &#8804; (1 + ) &#8224; &#8462;-1 + &#8462; ( &#8224; &#8462; , &#8224; &#8462; ) . Therefore, &#8224; &#8462; is in the safe action set U ,&#8462; and so U ,&#8462; is not empty.</p><p>Therefore by the discussion at the beginning of this proof, the Proposition is proved. &#9633; D Proof of Theorem 4.4</p><p>We denote the policy LAOC on the basis of the ML policy &#732; and the action set U ,&#8462; as</p><p>where &#8462; is the ML input at round &#8462;, and is the projection function in <ref type="bibr">(8)</ref> or the linear function in <ref type="bibr">(9)</ref>. By directly applying the ML policy &#732; without projection or linear operations, we get the action sequence { &#732; &#8242; &#8462; , &#8462; &#8712; [ ]} and the state sequence { &#732; &#8462; , &#8462; &#8712; [ ]}, and the corresponding ML inputs (which include &#732; &#8462; ) are denoted as &#732; &#8462; .</p><p>Lemma D.1. Given two constants 1 &gt; 0 and 0 &#8712; (0, ), if the potential function is designed as &#8462; ( ) = &#8462; &#8741; &#8462; ( &#8462; , &#8462; ) -( &#8224; &#8462; , &#8224; &#8462; )&#8741; 2 with &#8462; &#8805; 0 satisfying 2 2 &#8462; &#8804; &#8462;-1 -(1 + 1 0 ) 2 for &#8462; &#8712; [ -1], = 0, then &#8462; is in the competitive action set <ref type="bibr">(7)</ref> if</p><p>&#8462; is the risk of the prior at time &#8462;, and =</p><p>Proof. Note that at time &#8462; -1, the competitiveness constraint holds as</p><p>and the sufficient condition for &#8462; &#8712; U ,&#8462; is &#8462; ( &#8462; , &#8462; ) + &#8462; ( &#8462; )&#8462;-1 ( &#8462;-1 ) &#8804; (1 + ) &#8462; ( &#8224; &#8462; , &#8224; &#8462; ).</p><p>Iteratively applying <ref type="bibr">(39)</ref>, we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9633;</head><p>Proof of Theorem 4.4</p><p>Proof. Now we are ready to bound the difference of expected costs of LAOC and the pure ML policy &#732; which is E 0:</p><p>( 0: ) -E 0: &#732; ( 0: )</p><p>We can bound this difference as</p><p>( 0: ) -E 0: &#732; ( 0: ) =E 0: &#8462;=0 &#8462; &#8462; , ( &#732; ( &#8462; ), U ,&#8462; )&#8462; ( &#8462; , &#732; ( &#8462; ))</p><p>where the first inequality holds because the cost functions &#8462; are -Lipschitz continuous, &#732; is -Lipschitz and &#732; &#8462;&#8462; = &#732; &#8462;&#8462; for the same context instance. and the second equality is due to the definition of &#8462; ( &#8462; ) = &#8741; ( &#732; ( &#8462; ), U ,&#8462; ) -&#732; ( &#8462; )&#8741; in Lemma D.2 and &#8741; &#732; ( &#732; &#8462; ) -&#732; ( &#8462; )&#8741; &#8804; &#8741; &#732; &#8462;&#8462; &#8741; = &#8741; &#8462; -&#732; &#8462; &#8741;.</p><p>By Lemma D.2, we can further bound the expected cost difference as E 0:</p><p>( 0: ) -E 0: &#732; ( 0: )</p><p>where &#8242; = ( &#8730; 1 + -1) 2 , &#8462; = &#8741; &#732; ( &#732; &#8462; ) - &#8224; ( &#8224; &#8462; )&#8741;, and ( ) = 0 as there is no action at round . By Lemma D.3, the expected cost is bounded as E 0:</p><p>( 0: ) -E 0: &#732; ( 0: ) &#8804; E 0: Proof. Since the policy ( ) is one from the constrained policy set &#928; , we apply the statistical generalization theorem in <ref type="bibr">[15]</ref> and get with probability at least 1 -, &#8712; (0, 1),</p><note type="other">E</note><p>( ) -1 =1 ( ) ( ( ) 0: ) &#8804; 4 2 ln 4 ( , &#928; , &#710; 1 ) ,</p><p>where ( , &#928; , &#710; 1 ) is the -covering number of the competitive policy space &#928; with 1 -norm as the distance measure: the distance of two functions and &#8242; is &#8741; -&#8242; &#8741; &#710;</p><p>1 = 1 =1 &#8741; ( ( ) ) -&#8242; ( ( ) )&#8741; 1 . By Eqn. (10), we have 1 =1 ( ) ( ( ) 0: ) &#8804; 1 =1 * ( ( ) 0: ). Thus, we have E ( ) &#8804; 1 =1 * ( ( ) 0: ) + 4 2 ln 4 ( , &#928; , &#710; 1 ) &#8804; E * + 8 2 ln 4 ( , &#928; , &#710; 1 ) ,</p><p>where the last inequality holds be applying the generalization theorem in <ref type="bibr">[15]</ref>. By Eqn.(44), we have</p><p>where O notation indicates the increasing with episode length and maximum loss value &#9633;</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The amount of water supply per hour can be adjusted by either controlling the activation time of the pumps or the speed of the pumps. For ease of computation, we assume the speed of the pumps is constant within each hour.</p></note>
		</body>
		</text>
</TEI>
