<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Optimization of Offloading Policies for Accuracy-Delay Tradeoffs in Hierarchical Inference</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>05/20/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10570474</idno>
					<idno type="doi">10.1109/INFOCOM52122.2024.10621325</idno>
					
					<author>Hasan Burhan Beytur</author><author>Ahmet Gunhan Aydin</author><author>Gustavo de_Veciana</author><author>Haris Vikalo</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We consider a hierarchical inference system with multiple clients connected to a server via a shared communication resource. When necessary, clients with low-accuracy machine learning models can offload classification tasks to a server for processing on a high-accuracy model. We propose a distributed online offloading algorithm which maximizes the accuracy subject to a shared resource utilization constraint thus indirectly realizing accuracy-delay tradeoffs possible given an underlying network scheduler. The proposed algorithm, named Lyapunov-EXP4, introduces a loss structure based on Lyapunov-drift minimization techniques to the bandits with expert advice framework. We prove that the algorithm converges to a near-optimal threshold policy on the confidence of the clients' local inference without prior knowledge of the system's statistics and efficiently solves a constrained bandit problem with sublinear regret. We further consider settings where clients may employ multiple thresholds, allowing more aggressive optimization of overall accuracy at a possible loss in fairness. Extensive simulation results on real and synthetic data demonstrate convergence of Lyapunov-EXP4, and show the accuracy-delay-fairness tradeoffs achievable in such systems.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Machine learning (ML) applications and services are evolving to enable finding solutions to increasingly more challenging problems by deploying computationally intense models <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. This presents a challenge to their widespread deployment, especially on platforms with limited power and computational capabilities such as smartphones and IoT devices.</p><p>At one extreme, commonly seen in many real-world ML based applications <ref type="bibr">[3]</ref>, the entire ML model is executed on a server with sufficient computational power allowing devices to draw on minimal local computation. Although this approach relieves the computational burden on the users' devices, it leads to increased communication costs, reduced responsiveness, and may bring up privacy concerns.</p><p>At the other extreme, one can run ML based applications solely on users' devices. To facilitate this, there has been significant research on techniques for compressing complex ML models, resulting in what is commonly referred to as tinyML <ref type="bibr">[4]</ref>. While this allows running models with small computational and memory footprints, they often come at the cost of sacrificing performance and accuracy as compared to the larger complex ML models typically run at the edge/cloud. Between these extremes is Hierarchical Inference, a framework aiming to enable combining execution of ML tasks on users' devices with computational offloading to a server <ref type="bibr">[5]</ref>- <ref type="bibr">[8]</ref>. In particular, users' devices employ a low-complexity ML model (L-ML) while the server hosts a more powerful high-complexity ML model (H-ML). A task is first processed by L-ML on the user's device. If the result requires further investigation or is deemed to have low confidence, the task is offloaded to the server to be processed by H-ML. Unlike DNN Partitioning <ref type="bibr">[9]</ref>, <ref type="bibr">[10]</ref>, Hierarchical Inference does not (partially) offload every task to the server, which helps improve the system's responsiveness. Although a hierarchical inference system can be built based on any pair of L-ML and H-ML, or with an ML model allowing early exit <ref type="bibr">[11]</ref>, it requires an offloading policy based on the output of the L-ML decides if offloading is needed. Furthermore, such decisions might depend on congestion of the communication resources.</p><p>Our work is focused on hierarchical inference systems with multiple user devices and random offloading costs. We propose an online algorithm that learns how to manage offloading in such settings while being oblivious to the statistics of ML models, input task and offloading cost distributions. The proposed algorithm utilizes a multi-armed bandit with expert advice framework <ref type="bibr">[12]</ref> to maximize system's inference accuracy while deploying a novel loss structure based on Lyapunov optimization theory to meet a constraint on resource utilization or, more generally, on offloading costs. The algorithm converges to an optimal threshold policy applied to the inference confidence seen on the L-ML model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Related Work</head><p>Recently, hierarchical inference systems have attracted significant attention from the research community <ref type="bibr">[5]</ref>- <ref type="bibr">[8]</ref>. An important line of this research focuses on designing metrics that capture the confidence of the local inference without knowing the correctness of the result. Examples include <ref type="bibr">[6]</ref>, where a confidence metric based on the variance of the local classifier's output is proposed, and <ref type="bibr">[7]</ref>, where the confidence of a prediction is assessed by applying a radial function kernel to the entropy of the local classifier's output. Ultimately, confidence metrics are meant to aid in designing offloading algorithms for hierarchical inference systems such as the one in <ref type="bibr">[8]</ref>, which deploys multiple local ML models on devices and a more complex/accurate ML model on a server. Given the different accuracy, processing and computation times of the models, the accuracy maximization problem under a time constraint is formulated as an integer linear program. An online learning approach proposed in <ref type="bibr">[5]</ref> seeks a threshold policy on the confidence of L-ML that maximizes the accuracy under a fixed offloading cost. Although our work has a similar goal as <ref type="bibr">[5]</ref>, it is distinct as it focuses on multiple users under a shared utilization constraint and the distributed online offloading algorithm.</p><p>The long-term stochastic optimization problems encountered in various computation offloading problem, are studied using Lyapunov optimization techniques, specifically Driftplus-Penalty algorithm <ref type="bibr">[13]</ref>. The Drift-plus-Penalty formulation is used in <ref type="bibr">[14]</ref>, <ref type="bibr">[15]</ref> to solve offloading problems, where <ref type="bibr">[14]</ref> investigates a mobile-edge computing system with energy harvesting devices under a hard execution delay deadline, and <ref type="bibr">[15]</ref> formalizes the joint service caching and task offloading problem for minimizing computation latency under a long-term energy consumption constraint. Similarly, in <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref> authors present long-term optimization problems as twostep optimization problems and leverage Drift-plus-Penalty algorithm to solve one of the steps. In all these works, the Lyapunov formulation involves an integer problem and the solution requires knowing the system parameters at each slot.</p><p>In another line of related recent work, Lyapunov optimization techniques have been combined with online learning mechanisms to develop algorithms addressing constrained bandit optimization problems <ref type="bibr">[18]</ref>- <ref type="bibr">[23]</ref>. Specifically, <ref type="bibr">[18]</ref>, <ref type="bibr">[19]</ref>, utilize Lyapunov-drift minimization methods for constrained online-dispatching via the UCB algorithm. In <ref type="bibr">[20]</ref>, a generic constrained bandit problem which incorporates random costs of actions, a knapsack constraint, and a stochastic feasibility constraint was studied; there, a Lyapunov-drift minimization technique was utilized to design a low-complexity bandit algorithm based on UCB. Our work addresses constrained contextual bandit problems by combining Lyapunov drift minimization techniques with EXP4 algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Contributions</head><p>The contributions of this paper are summarized as follows.</p><p>&#8226; For a hierarchical inference system supporting multiple clients, we formulate an accuracy maximization problem under a shared resource utilization constraint and explore the associated accuracy-delay trade-offs. We combine Lyapunov drift minimization techniques and bandits with expert advice framework to propose a low-complexity online offloading algorithm that we refer to as Lyapunov-EXP4 (Ly-EXP4). Ly-EXP4 converges to a near-optimal thresholding policy on the local inference confidence while satisfying the utilization constraint without prior knowledge of the statistics of tasks, confidence metric, and service times across clients. We prove that for finite horizon N and M discretized thresholds, Ly-EXP4 has a regret bound of O( p 2N &#8675; log(M )) and an optimality-gap of O( <ref type="formula">1</ref>V )+&#9999; 0 , where V is a parameter controlling tradeoff between accuracy and constraint violation. To the best of our knowledge, our work is the first one in the literature discussing a hierarchical inference problem with multiple clients under a utilization constraint.</p><p>&#8226; We show that Ly-EXP4 can be used as a distributed online offloading algorithm maximizing the total accuracy of a system under a resource utilization constraint. In such settings, each client or a group of clients simultaneously learns an individual thresholding policy. &#8226; Ly-EXP4 is not limited to the hierarchical inference problem; instead, it can be used in a more general class of constrained contextual bandit problems -a potentially valuable contribution in itself. &#8226; We provide extensive simulation results using both real and synthetic ML models and datasets, demonstrating the performance of Ly-EXP4 in terms of convergence, delay and fairness. Additionally, we explore the accuracyfairness trade-off in settings with multiple thresholds which allow Ly-EXP4 to achieve higher accuracy at the expense of fairness across clients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. SYSTEM MODEL FOR THE HIERARCHICAL INFERENCE</head><p>We consider a hierarchical inference system with a set K of clients k 2 K connected to a server over a shared communication resource, executing classification tasks using a neural network based classifier. The classification tasks that the clients face are first processed locally using a small ML model with low complexity and low accuracy (L-ML); in addition to the prediction, L-ML quantifies the confidence in its decision. After comparing the confidence with a threshold, the client decides whether to offload the task to a server to be processed with a more complex ML classifier with higher accuracy (H-ML). However, shared communication resources limit the number of tasks that the clients can offload.</p><p>We assume that classification tasks arrive to a client according to a Poisson process. Let k denote the arrival rate of tasks to client k, and let = P k2K k . The n th task arriving to the system is received by client K n at time T n . Let us denote the true class of task n and the class predicted by the local L-ML by C n and &#264;n , respectively. We assume that the true class of a task received by client k is independent and identically distributed according to an unknown clientspecific distribution, i.e., &#8629; k = (&#8629; k (c) : c 2 C), where &#8629; k (c) denotes the probability that a task arriving to client k is of class c, and C is the set of classes. Since true class distributions of tasks differ from one client to another, and the clients deploy potentially different L-ML models, the predicted class distributions generally vary across the clients.</p><p>An L-ML model locally processing classification task n provides predicted class &#264;n and confidence measure Z n . Given C n and &#264;n , Z n is assumed to be independent and identically distributed on a bounded support &#8998; Z . As with C n and &#264;n , we do not assume a specific distribution for Z n .</p><p>Since in our hierarchical inference system the offloading decisions are based on the local inference confidence, the quality of confidence metric is essential. Nevertheless, the offloading algorithm introduces in Section III converges to an optimal threshold policy without prior knowledge of the distribution of Z n , regardless of the choice of metric.</p><p>When L-ML is a neural network used for classification, a simple choice for the confidence metric is the largest output of the last soft-max layer; such a quantity must be at least 1/|C| (when all class outputs are equal), and cannot exceed 1.</p><p>Next, we define the stationary threshold policy for the hierarchical inference.</p><p>Definition 1 (Stationary Threshold Policy). The offloading decision under a threshold policy &#8673; &#10003; with a threshold &#10003; 2 &#8998; Z is given by</p><p>where I &#8673; &#10003; n = 1 indicates offload of task n. Offloading a task incurs resource expenditures, e.g., extra load on the communication channel or an operational cost of using the server. The offloading cost of task n is modelled by a random variable X n ; the offloading costs of the tasks received by client k are assumed to be independent and identically distributed according to an unknown client-specific distribution with a bounded support (without a loss of generality, [0,1]) and mean</p><p>We consider the case where the clients share a pool of resources, e.g., wireless access to a BS/server. The constraint on resource utilization is given by</p><p>where &#710; &#8673; &#10003; k is the offloading rate of client k under the stationary threshold policy &#8673; &#10003; . In practical systems with shared resources, the offloading cost can be thought of as the monetary value of using computational resources at the server, where can be interpreted as the budget per unit operational time. When it comes to communications, the mean offloading cost &#181; 1 k can be thought of as the mean service time of client k, with being a utilization constraint which indirectly controls the average delay in the system. The latter interpretation of is due to an observation that in a network where a scheduling policy controls how multiple clients share communication resources, utilization can be treated as a proxy for average delay as the two quantities are monotonically related to each other.</p><p>Although our model and algorithm can be used in more general settings, in the remainder of this paper we focus on the accuracy-delay tradeoff in hierarchical inference systems where the task offloading follows a scheduling policy. Since the offloading rates &#710; &#8673; &#10003; k are well-defined under the stationary threshold policy &#8673; &#10003; , we have that</p><p>Once task n is received by client K n at time T n , the client processes the task using its L-ML model. Based on the confidence Z n of the L-ML prediction, the client decides whether to send the task to the server to be classified by the H-ML model. We assume that the H-ML classifies the tasks perfectly, i.e., has 100% inference accuracy, and that the computation time by the client and the server is negligible compared to the communication time needed to offload the task. In addition, we assume that there is a feedback channel between the server and the clients, which is used to transmit parameters required by the algorithm (e.g., inference results and bandit loss of an offload, formally defined later in Section III-B). Since the transmitted data is relatively small, we assume that the delay on the feedback channel is negligible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Problem Formulation</head><p>Given the shared communication resource and the hierarchical inference mechanism, we want to find an offloading policy that maximizes the overall inference accuracy of the system while satisfying the utilization constraint. By substituting (4) in (2), we write the inference accuracy maximization problem as the minimization min</p><p>where Y n = I{C n 6 = &#264;n }, e.g., the indicator of an incorrect local classification. If X n and Y n were available at each time step before taking the offloading decision, the constrained optimization problem in ( <ref type="formula">5</ref>) could be solved using the Lyapunov drift minimization technique referred to as the driftplus-penalty algorithm <ref type="bibr">[13]</ref>. However, due to the structure of the hierarchical inference system, X n and Y n are only revealed if task n is offloaded. Since they are observed in hindsight, we need a methodology that allows efficient exploration/exploitation of the system's unknown statistics. To this end, we adopt a constrained bandit framework. Next, we introduce an algorithm based on Lyapunov-EXP4 that makes offloading decision based on a single threshold. In Section III-C, this algorithm is extended to the multiple thresholds case, enabling its distributed implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. THRESHOLD OFFLOADING POLICY USING LYAPUNOV-EXP4 (LY-EXP4)</head><p>To solve the optimization problem (5), we propose a computationally efficient online learning algorithm Lyapunov-EXP4 (Ly-EXP4) based on the contextual bandits framework. Ly-EXP4 makes decisions using a loss structure that employs the Lyapunov drift minimization technique, maximizing accuracy while satisfying a long-term average expectation constraint.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Virtual Queue and Drift-plus-Penalty Ratio</head><p>In this subsection, we introduce a virtual queue capturing the constraint ( <ref type="formula">6</ref>) and a drift-plus-penalty ratio motivated by <ref type="bibr">[13]</ref>, which are then utilized to specify the bandit loss. To start, we define the causal policy space for our problem.</p><p>Definition 2 (Causal Policy). Let &#8673; be a policy that yields a sequence of offloading decisions {I &#8673; n 2 {0, 1} : N n 1}. Under &#8673;, the history until time n is the filtration</p><p>where (&#8226;) denotes the sigma-field, and X t and Y t are observed only if task t is offloaded (as implied by the terms X t I &#8673; t and Y t I &#8673; t ). A policy &#8673; is said to be causal if &#8673; is nonanticipating, i.e., {I &#8673; n = a} 2 F &#8673; n 1 for all a, n.</p><p>We define the virtual queue under casual policy &#8673; as</p><p>where Q &#8673; 1 = 0. The time average expectation constraint (6) can be viewed as a mean rate stability constraint of a virtual queue <ref type="bibr">[13]</ref> under causal policy &#8673;. To see how the mean rate stability of Q &#8673; n enforces <ref type="bibr">(6)</ref>, note that by the telescoping argument</p><p>Taking expectations and letting N ! 1 one can show that</p><p>If</p><p>where</p><p>denote the conditional expectation given the history up to time n. Denote the Lyapunov drift by</p><p>where Q &#8673; n is measurable with respect to F &#8673; n 1 for all n (&#8673; is causal). At each slot, the drift-plus-penalty algorithm opportunistically minimizes an upper bound on the drift <ref type="bibr">(13)</ref> summed with the one-slot penalty, which in our setting leads to</p><p>where V &gt; 0 is the parameter controlling the trade-off between penalty and drift. Note that if X n and Y n were observed before the offloading decision of task n, i.e., if X n , Y n 2 F &#8673; n 1 , the one-slot optimization problem the driftplus-penalty algorithm tackles would become min</p><p>The aforementioned bandit algorithm Ly-EXP4, formally presented in the next subsection, deploys <ref type="bibr">(15)</ref> as the loss at time n and exhibits convergence to a near-optimal solution of (5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Ly-EXP4: A Bandit with Expert Advice</head><p>In the bandits with expert advice framework, instead of learning an arm selection policy, the agent learns a way of combining experts' predictions that minimizes the regret. One common algorithm to tackle this problem is the EXP4 <ref type="bibr">[12]</ref>, which randomly selects arms according to a distribution computed by a soft-max operation on the cumulative loss each expert received.</p><p>We formulate hierarchical inference as a multi-arm-bandit problem with expert advice; the offloading decisions are denoted by a 2 {0, 1}, where a = 0 corresponds to "not offloading" while a = 1 corresponds to "offloading". In addition, we assume that there are M experts, each corresponding to a threshold policy as in (1) with a threshold &#10003; m 2 M, m 2 [1, . . . , M], where M is a set of discretized thresholds on the support of the confidence metric &#8998; Z . Note that discretizing the thresholds induces a discretization error; while we do not assume a specific discretization scheme, we assume that the optimality gap between the best threshold in M and the optimal (continuous) solution to (5) is negligible. Assumption 1. Let A N (&#8673;) denote the accuracy at time N under policy &#8673;, i.e.,</p><p>Given a set of discretized thresholds M, there exists at least one threshold &#10003; m 2 M such that lim</p><p>where &#10003; &#8676; 2 &#8998; Z is the optimal threshold for problem <ref type="bibr">(5)</ref>.</p><p>A simple discretization scheme would be to select uniform discretization intervals, which ensures &#9999; 0 &#63743; 1 M . Note that Assumption 1 enforces not only a condition on the discretization scheme but also a smoothness property on the distribution of the confidence metric Z n .</p><p>Next, we define bandit loss based on (15); for task n, let l n,a denote the loss corresponding to decision a 2 {0, 1},</p><p>Note that such a loss is a combination of the minimization objective and the utilization constraint. Based on <ref type="bibr">(18)</ref>, the loss associated with the offloading decisions taken by the algorithm, L n , and the loss associated with the m th expert, L n,m , can be written as</p><p>where I n,m = I{&#10003; m &gt; Z n } is the offloading decision of expert m for the task n. Since the virtual queue process Q n is driven by the decisions Ly-EXP4 makes, expert loss L n,m depends not only on the expert's own decisions I n,m but also on the previous decisions of Ly-EXP4.</p><p>The regret that we want to minimize is defined relative to the best expert in hindsight,</p><p>The regret R N , loss L n and expert loss L n,m are all incurred under the Ly-EXP4 policy. However, for notational simplicity, we drop superscript &#8673; LYEXP4 from these expressions. Ly-EXP4 acts in this bandit setting similarly to EXP4. Depending on Z n , each expert incurs a loss due to either incorrect local inference or offloading cost. Based on the experts' cumulative loss, Ly-EXP4 computes normalized weight for each expert via soft-max, where a weight represents "contribution" of an expert to the stochastic offloading decision.</p><p>Since X n and Y n are observed only if a task is offloaded, experts' losses are not always known. To overcome this problem, we employ an importance-weighted estimator of the experts' loss, Ln,m , as defined below. Let &#348;n,m = P n t=1 Lt,m denote the cumulative estimated loss for expert m until time n. The normalized weight of expert m at time n is defined as</p><p>It follows from ( <ref type="formula">21</ref>) that an expert with a higher loss will be assigned a smaller weight. At time n, Ly-EXP4 takes a random offloading decision with probability p n equal to the sum of the weights of the experts advising to offload based on the observed confidence Z n ,</p><p>Note that this procedure is effectively as same as sampling an expert m at random from distribution ! n = (! n,m : m 2 M) and following the offloading decision of the selected expert, where the experts with lower cumulative loss have higher probability of being chosen. Since p n depends on the losses received until time (n 1), we define the importance-weighted loss estimate as</p><p>If task n is offloaded, the experts incur a loss according to their own offloading decisions I n,m ; if task n is not offloaded, experts incur zero loss. The next lemma establishes that Ln,m is an unbiased estimator of l n,In,m .</p><p>The proof of Lemma 1 is given in Appendix A.</p><p>In contrast to EXP4, the offloading decisions in Ly-EXP4 affect the loss in subsequent steps through the virtual queue Q n , representing the constraint violation. As a consequence of the adopted loss function, when the resource utilization grows the weights of the experts in favor of offloading become smaller. Similarly, when the resource utilization is low, the algorithm tends to offload more due to decreased offloading loss. Ly-EXP4 is formalized as Algorithm 1. In what follows, we prove finite-horizon regret bounds for Ly-EXP4.</p><p>Proposition 1 (Ly-EXP4 Regret Analysis). Given the number of experts M , learning rate &#8984; &gt; 0, and V &gt; 0, the regret under Ly-EXP4 satisfies</p><p>where &#8675; &gt; 2. Specifically, for &#8984; =</p><p>The proof is outlined in Appendix B. Note that Ly-EXP4 converges to the threshold minimizing the regret based on the loss formed as a combination of the objective and the problem constraint. Therefore, the best expert's policy Ly-EXP4 converges is not necessarily the optimal threshold policy for problem <ref type="bibr">(5)</ref>. In the next theorem, we prove Ly-EXP4 is asymptotically near-optimal.</p><p>Theorem 1 (Optimality of Ly-EXP4). Let &#8673; and &#8673; &#10003; &#8676; denote the policy of Ly-EXP4 and the optimal threshold policy, respectively. By Assumption 1, the regret bound in Proposition 1, and for</p><p>where m 0 is the expert with threshold &#10003; m 0 2 M that yields the best solution to problem <ref type="bibr">(5)</ref>.</p><p>and as N ! 1, the optimality gap between Ly-EXP4 and the optimal threshold policy is lim</p><p>The proof is provided in Appendix C. Algorithm 1 Lyapunov -EXP4 (Ly-EXP4)</p><p>Observe Z n , 5:</p><p>Choose the offloading decision I n &#8672; p n , 7:</p><p>Compute estimates of expert losses Ln,m , 8m</p><p>&#348;n,m = &#348;n 1,m + Ln,m , 8m 9:</p><p>As seen from Proposition 1, the regret achieved by Ly-EXP4 differs by only a constant factor from EXP4. This means that by Theorem 1, Ly-EXP4 solves constrained bandit problems at a convergence rate similar to that of EXP4. Note that increasing M and/or V will decrease the sub-optimality error, but M increases the complexity of the algorithm while V increases the sensitivity to constraint violation and decreases convergence speed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Extension to Multiple Thresholds</head><p>In Section III-B, we provide a regret and discuss an optimality analysis for the setting where a single instance of Ly-EXP4 is used to find a single threshold policy applied by all clients. However, by treating different groups of clients as different "contexts", one can deploy separate instances of Ly-EXP4 for each group of clients to learn a multi-threshold policy for the offloading problem <ref type="bibr">(5)</ref>. At one extreme, each client can use a separate instance of Ly-EXP4 to find an individual threshold policy, which eventually allows our algorithm to be used in a distributed manner.</p><p>In what follows, we show the regret for the multi-threshold case based on the principles outlined in Chapter 18.1 of <ref type="bibr">[12]</ref>.</p><p>Let g 2 G denote the set of disjoint groups of clients, and G n denote the group task n belongs. By Proposition 1, the regret of the Ly-EXP4 instance executed for group</p><p>where the sum under the square root counts the number of tasks that belong to group g 2 G. Since the number of tasks received by the clients in a group is not known in advance, we use the changing learning rate encountered in the anytime version of EXP4,</p><p>Defining R N as the sum of individual regrets across all groups, the regret for multi-threshold case can be bounded as</p><p>where (32) is the worst-case regret corresponding to the setting where each group receives same number of tasks.</p><p>As seen from ( <ref type="formula">32</ref>), the regret bound for multiple threshold case scales up with the square root of the number of client groups. As discussed in Section IV, although Ly-EXP4 learns a multiple threshold policy slower than a single threshold policy, it achieves significantly better accuracy with a multiple threshold policy by exploiting the heterogeneity in service time and confidence metric distributions across client groups.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. SIMULATION RESULTS</head><p>We conducted extensive simulations running Ly-EXP4 in both single and multiple thresholds settings on real and synthetic models. Specifically, we consider a hierarchical inference system with 4 client groups -with 25 clients each, each experiencing different service times due to various factors including distance to the server and channel quality. For this system, we report results that demonstrate convergence of Ly-EXP4 as well as delay-accuracy and fairness-accuracy tradeoffs.</p><p>In the synthetic model, client k 2 K (g) receives tasks according to a Poisson process with rate k , where K (g) denotes the set of clients in client group g 2 G. The true class C n is assigned to each task according to a client-specific class distribution &#8629; k , where the arrival rates k , and the class probabilities &#8629; k (c), c 2 C are sampled from a uniform distribution with a support [0, 1] and normalized so that total arrival rate = P g2G P k2K (g) k = 1 and</p><p>A synthetic classifier model for L-ML is characterized by a confusion matrix, denoted by H, where each element represents the likelihood of the L-ML model predicting class c 0 given that the true class is c, i.e., H c,c 0 = P( &#264;n = c 0 |C n = c). The elements of H are generated uniformly at random from [0, 1], and the diagonal elements are adjusted such that the model is typically most confident about the true class. Each row is then normalized so that</p><p>In addition to the above, confidence Z n is sampled from a distribution depending on the correctness of the inference,</p><p>In the case of real datasets and models, the tasks are assigned to the clients with probability k / regardless of the true class of a task; confidence Z n is observed as the maximum of the last soft-max layer of the classifier model. We utilize relatively small models that are deployable on IoT devices, including a simple neural network with a single dense layer applied to FMNIST dataset <ref type="bibr">[24]</ref>, and a LeNet-5 model <ref type="bibr">[25]</ref> applied to CIFAR-10 dataset <ref type="bibr">[26]</ref>.</p><p>The clients in a client group experience random service times with group-specific distributions. We assign equallyseparated mean service times [0.2, 0.4, 0.6, 0.8] to the client groups. The service times of a client in group g are distributed according to a beta distribution with mean &#181; 1 g and variance 0.01. The service times for synthetic and real dataset/model cases are generated the same way.</p><p>To our knowledge, this is the first work focusing on the hierarchical inference under utilization constraint. To benchmark the proposed Ly-EXP4 algorithm, we compared it with the optimal threshold policies and a simple token bucket policy. When running Ly-EXP4, confidence support &#8998; Z is uniformly discretized using M = 1000 thresholds. To characterize upper performance limits, we introduce a genie algorithm: a noncausal offline algorithm with access to the ground truth, capable of offloading the maximum number of incorrectly classified tasks at the lowest cost within the constraint. We also compute the optimal single and multiple threshold policies using an exhaustive search over [1/C, 1] |G| . The aforementioned token bucket policy makes greedy offloading decisions based on the service times of the tasks and the token bucket increments by at each task arrival. It offloads task n if the token bucket has more tokens than the service time X n . When task n is offloaded, X n gets deducted from the token bucket. Therefore, it is aware of the service times before offloading.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Convergence of Ly-EXP4</head><p>Figure <ref type="figure">1</ref> shows the average accuracy and utilization achieved by Ly-EXP4 and the benchmarks on both synthetic and real datasets/models. The service times, arrival rates and utilization constraint parameter are kept constant across the simulations; however, inference confidences, L-ML accuracy of the models, and task class distributions &#8629; k depend on the dataset and model used. From these graphs, it is evident that both versions of the Ly-EXP4 algorithm successfully converge to their respective optimal solutions within the specified constraint. Additionally, it is observed that Ly-EXP4 achieves higher accuracy with a slightly slower convergence speed by exploiting heterogeneity across clients while meeting the utilization constraint.</p><p>Note that in some scenarios the token bucket algorithm achieves accuracy similar to Ly-EXP4 with multiple thresholds (see Figure <ref type="figure">1a</ref>), while in other scenarios it performs worse than Ly-EXP4 with single threshold (see Figure <ref type="figure">1c</ref>). This inconsis- tent performance of the token bucket algorithm is due to its greedy policy that prioritizes the clients with small service times. If those clients' L-ML have low local accuracy, token bucket achieves very good performance; otherwise, it offloads many correctly classified task and thus uses the resources inefficiently. Ly-EXP4 avoids this by making decisions based on the inference confidence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Delay-Accuracy Tradeoff</head><p>We further investigate the accuracy-delay trade-off relying on the monotonic relation between utilization and average delay that is typical of scheduling policies. We simulate a network scenario where at most one task from each client can be offloaded using a shared communication channel under a processor sharing scheduling. If a client has an ongoing offload, any new tasks received by the client wait in a client-specific FCFS queue. If a task is offloaded, the delay includes the waiting time in the client's FCFS queue and the transmission time on the processor sharing channel; if a task is not offloaded, its delay is zero. The communication time of task n depends not only on the random service time X n but also on other tasks simultaneously being offloaded.</p><p>Ly-EXP4 updates its weights based on the loss it receives after each offload, which is acceptable for scheduling policies where offloads are performed sequentially (e.g., FCFS). However, this poses a problem for scheduling policies allowing simultaneous offloads such as ours, since some offloading decisions may be taken based on outdated parameters. It was shown in <ref type="bibr">[27]</ref> that delayed feedback in EXP3 causes an increase in the regret bound which scales with the number of decisions based on outdated parameters. However, the effect of delayed feedback in our settings is negligible since the offloading decisions for the upcoming tasks are made only after the ongoing offload is completed.</p><p>For increasing utilization constraint, corresponding average delay and accuracy results are illustrated together in Figure <ref type="figure">2</ref>. One can observe diminishing returns in accuracy with respect to delay. Moreover, the accuracy improvements obtained by utilizing multiple thresholds is more significant when the utilization constraint is low, implying that using multiple thresholds is more beneficial when the resources are scarce and thus more efficient offloading decisions are needed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Fairness-Accuracy Tradeoff</head><p>As the results show, Ly-EXP4 with multiple thresholds generally achieves higher accuracy than the single threshold strategy. This is accomplished by exploiting the heterogeneity in service time, and task and class distributions across clients. We complement those results by characterizing fairness as quantified by the Jain's index</p><p>N , . . . , A (K) Figure <ref type="figure">3</ref> shows that Ly-EXP4 with a single threshold is almost perfectly fair across clients, which also means that a single instance of Ly-EXP4 is fair to the clients within a group. As the number of client groups grows, one observes an improvement in the overall accuracy at the expense of fairness. This allows flexibility when utilizing Ly-EXP4 in different settings. By comparison, the token bucket policy achieves fairness similar to Ly-EXP4 with 4 groups. However, token bucket cannot provide an option of trading fairness for accuracy the way Ly-EXP4 can. V. CONCLUSION In this paper, we proposed a low-complexity online offloading algorithm that we refer to as Lyapunov-EXP4 (Ly-EXP4). Ly-EXP4 combines Lyapunov drift minimization techniques and the bandits with expert advice framework to maximize accuracy in a hierarchical system with multiple clients, converging to an optimal threshold policy within the utilization constraint. We proved a sublinear regret bound and nearoptimality of Ly-EXP4; the proposed framework and presented analysis may be relevant to a broader class of constrained bandit problems. The convergence of Ly-EXP4 and the delayaccuracy-fairness tradeoffs achievable in hierarchical inference systems are demonstrated via simulations on real and synthetic datasets and models. As part of the future work, we aim to explore settings with multiple servers having varying communication capacities and model accuracy, and offloading policies that exploit heterogeneity across not only clients but also task classes. Exploring fairness-performance tradeoffs in such system present another line of potentially interesting future work.</p></div></body>
		</text>
</TEI>
