<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Edge-RT: OS Support for Controlled Latency in the Multi-Tenant, Real-Time Edge</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 December</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10398133</idno>
					<idno type="doi">10.1109/RTSS55097.2022.00011</idno>
					<title level='j'>IEEE Real Time Systems Symposium</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Wenyuan Shao</author><author>Bite Ye</author><author>Huachuan Wang</author><author>Gabriel Parmer</author><author>Yuxin Ren</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Embedded and real-time devices in many domains are increasingly dependent on network connectivity. The ability to offload computations encourages Cost, Size, Weight and Power (C-SWaP) optimizations, while coordination over the network effectively enables systems to sense the environment beyond their own local sensors, and to collaborate globally. The promise is significant: Autonomous Vehicles (AVs) coordinating with each other through infrastructure, factories aggregating data for global optimization, and power-constrained devices leveraging offloaded inference tasks. Low-latency wireless (e.g., 5G) technologies paired with the edge cloud, are further enabling these trends. Unfortunately, computation at the edge poses significant challenges due to the challenging combination of limited resources, required high performance, security due to multi-tenancy, and real-time latency. This paper introduces Edge-RT, a set of OS extensions for the edge designed to meet the end-to-end (packet reception to transmission) deadlines across chains of computations. It supports strong security by executing a chain per-client device, thus isolating tenant and device computations. Despite a practical focus on deadlines and strong isolation, it maintains high system efficiency. To do so, Edge-RT focuses on per-packet deadlines inherited by the computations that operate on it. It introduces mechanisms to avoid per-packet system overheads, while trading only bounded impacts on predictable scheduling. Results show that compared to Linux and EdgeOS, Edge-RT can both maintain higher throughput and meet significantly more deadlines both for systems with bimodal workloads with utilization above 60%, in the presence of malicious tasks, and as the system scales up in clients.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Embedded and real-time systems are increasingly required to provide features that must interact with the broader environment beyond their local sensors and actuators. Though IoT systems are notable for their Internet connectivity, even systems with strict predictability requirements are now commonly network-connected to leverage distributed sensors and computation. Industry 4.0 focuses on the interlinking of real-time machinery with network-connected aggregation and analytics to better manage factories, and vehicle-to-everything (V2X) communications acknowledge that vehicular decisions are empowered by communicating with and understanding the environment. Empowering this, modern millimeter-wave wireless technologies such as 5G aim for 1ms round-trip times (RTT) -current latencies are sub-10ms <ref type="bibr">[1]</ref> -thus providing latency levels that can potentially fit into the decision loops of many real-time systems. In contrast to computation hosted in the cloud's datacenters whose access imposes the significant jitter and high latency of the WAN, the edge cloud has basestations that are proximate to the embedded systems. For example, high-frequency 5G basestations have an effective range in the 100s of meters, enabling low-latency RTTs. Embedded systems can benefit from low-latency access to the edge for (1) offloading <ref type="bibr">[2]</ref> of computation from embedded devices to leverage the more capable hardware of the edge for higher-performance or memory-hungry computations while potentially lowering device power requirements, and (2) for sensor aggregation in which the sensor information from many devices can be aggregated, thus decisions can consider a more global state of the physical environment.</p><p>Unfortunately, unlike traditional cloud datacenters that contain hundreds of thousands of cores and provide the illusion of capacity elasticity, edge-clouds have much more constrained resources <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>. Despite this, the multi-tenant model that has driven the prominence of the cloud is desirable -and in some cases, necessary -in the edge-cloud. A multi-tenant infrastructure enables the execution of untrusted code, provided by potentially many different tenants that rent capacity on the server. Multi-tenancy is common on current edge deployments: (1) network slicing enables multiple cellular carriers to process packets using Network Functions (NFs) to share the basestation infrastructure <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref>, and (2) edge computation providers, such as Fastly, support tenant executions in isolated Webassembly sandboxes. For embedded systems to leverage the edge in general deployments (e.g., 5G basestations), there is a strong need for multi-tenancy. Unfortunately, this is difficult due to the relatively constrained resources of the edge cloud.</p><p>To effectively leverage edge-cloud systems, system software must meet the following requirements:</p><p>&#8226; End-to-end deadlines -at its core, the real-time edge must accommodate computations with the deadline requirements on the order of a few milliseconds -given the lower-bound of 1ms RTT for 5G. The edge must manage the end-to-end latency of requests between when they arrive, and when the reply is transmitted.</p><p>&#8226; Performance -the edge-cloud requires an efficiency capable of driving many 10s of Gbs network throughput. High performance makes worst-case provisioning unappealing, motivating a system that seeks to provide predictable latencies with unpredictable workloads.</p><p>&#8226; Density -unlike the traditional cloud that uses VMs and containers to isolate tenants, the smaller computational resource availability at the edge requires abstractions and mechanisms that efficiently enable higher-density.</p><p>&#8226; Multi-tenancy -similar to data-center-based clouds, the edge must support multi-tenancy. For example, content delivery network (CDNs) already find such support necessary (Fastly and Cloudflare). This requires untrusted code to execute in the edge, with unknown execution times, and potentially faulty or malicious logic.</p><p>&#8226; Dynamic workloads -the client workload changes with the environment. As autonomous vehicles (AVs), drones, and</p><p>&#8226; Network processing -basestations traditionally focus on network processing including properly accounting for bandwidth, and slicing the network <ref type="bibr">[5]</ref>, <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>, <ref type="bibr">[8]</ref> across carriers. This network processing is done by network functions (NFs) that transform and filter packets, and are often composed into chains that process packets, and pass them on to the next NF. Chains of isolated NFs enable multiple applications to process on packets, and enable NFs to provide limitations on each other. For example, the first and last NFs can provide firewall-like functionality to limit which packets can be processed and transmitted by NFs in the middle of the chain. The core question this paper seeks to answer is: is it possible to practically meet end-to-end deadlines of packets while still maintaining high-throughput and strong isolation in a multi-tenant, edge-cloud for dynamic and dense workloads?</p><p>One tempting answer is to directly adapt existing deadlinedriven scheduling systems (e.g., EDF) to edge-clouds. We argue that this is not sufficient because (1) many edge cloud infrastructures do not support preemptive scheduling ( &#167;II-A), <ref type="bibr">(2)</ref> to optimize for meeting end-to-end deadlines across chains of computations, normal per-thread prioritization is not a good fit for dynamic workloads ( &#167;III), and (3) high-throughput network systems seek to avoid per-message overheads, which is a bad match for OS abstractions that require locks and Inter-Processor Interrupts (IPIs) for coordination ( &#167;III).</p><p>This paper presents Edge-RT, an OS infrastructure built on the public EdgeOS <ref type="bibr">[9]</ref>, that focuses on packet-or messagebased deadline scheduling across chains of computations, while maintaining high performance, density, and isolation between client computations. Edge-RT focuses on practical mechanisms to meet deadlines while minimising per-message system overheads: (1) it associates deadlines with packets, and threads inherit these message deadlines as packets flow through the computation chains to provide end-to-end, deadline-based scheduling, and (2) creates mechanisms for coordination and execution that avoid per-message overheads for scheduling, batching, and inter-FWP, inter-core coordination while bounding interference. Contributions. Edge-RT's contributions center on providing deadline-focused computation in a high-throughput, highdensity environment. The contributions include:</p><p>&#8226; a system design that (1) focuses on per-packet end-to-end deadline scheduling with dynamic, dense workloads, and</p><p>(2) minimizes per-message system overheads.</p><p>&#8226; the mechanisms and abstractions for predictable buffering, inter-core coordination, and scheduling that enable efficient packet processing.</p><p>&#8226; The implementation of Edge-RT and the parameter studies to understand system overhead trade-offs to guide the system's configuration.</p><p>&#8226; The evaluation of Edge-RT for various workloads compared with Linux and EdgeOS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND A. Linux Kernel Bypass and In-kernel Sandbox</head><p>Kernel-bypass networking. Kernel-bypass networking (e.g., Data Plane Development Kit (DPDK) <ref type="bibr">[10]</ref>) enables user-level to directly interact with networking devices. DPDK avoids both system call and interrupt overheads by polling. Using interrupt-driven execution increases the round-trip time to 110&#181;s, which is close to the overheads of Linux sockets. Despite DPDK's efficiency, it has limitations in the edge cloud: low scalability and non-preemptive packet processing.</p><p>Tenant isolation requires the use of a separate DPDK instance per tenant. This is accomplished using virtual NICs through SR-IOV, or using Open vSwitch (OVS) <ref type="bibr">[11]</ref>. SR-IOV only supports up to 256 virtual NICs, thus cannot scale up to many tenants. OVS can scale as it acts as a virtual switch to route packets among multiple DPDK applications. Unfortunately, OVS limits scalability and density. For applications using DPDK and virtio [12], OVS with 256 DPDK applications requires one polling thread per 160K PPS, and each (minimal) application requires 110 MiB of memory.</p><p>Each tenant's DPDK application processes their client's requests sequentially, thus non-preemptively. To understand this effect, we run two types of computations in each DPDK application, a "Light" computation which is quick to process (40&#181;s), and a "Heavy" computation (varying between 10-25ms). This sequential execution model is common in throughput-centric network processing systems such as E2 <ref type="bibr">[13]</ref> and Netbricks <ref type="bibr">[14]</ref>. Figure <ref type="figure">1</ref> compares the tail latency of bypass techniques to native Linux sockets. In contrast, Linux execution of clients in separate processes is preemptive, in contrast to the sequential execution of client requests in bypass systems. Note that adding preemptive execution into the bypass requires multiple threads or processes, which incurs kernel overheads that bypass systems are designed to avoid. The workload maintains 50% utilization across 48 cores. Four cores are devoted to OVS, and four DPDK tenants execute per core. Each receives four concurrent client requests at 200 packets per second for light computation, and the heavy computation rate is adjusted with its changing weight (x-axis). Linux socket applications use a process per client. The hardware is detailed in &#167;VI.</p><p>This example demonstrates that maintaining kernel bypass for each tenant imposes non-preemptive execution in which the most latency-sensitive tasks suffer from convoy effects from heavy computations. Bypass properties are summarized in Table <ref type="table">I</ref>. The edge cloud must re-imagine such highthroughput networking techniques to intelligently, preemptively schedule clients.</p><p>Extended Berkeley Packet Filter (eBPF). eBPF <ref type="bibr">[15]</ref> is an in-kernel virtual machine which allows injecting application logic into the kernel at run time. XDP <ref type="bibr">[16]</ref> executes an eBPF program to process packets as part of the in-kernel packet reception path. eBPF has a few limitations <ref type="bibr">[15]</ref>, <ref type="bibr">[17]</ref>. (1) An eBPF program is restricted to an execution budget of 1 million instructions.</p><p>(2) eBPF programs execute non-preemptively (unless their budget runs out), even if executing in a highpriority NIC interrupt. (3) Loading eBPF programs is a privileged operation (e.g., root user) as they have sensitive access to kernel abstractions. These factors (summarized in Table <ref type="table">I</ref>) make eBPF a challenging choice for a multi-tenant execution environment -especially one that requires chains of computations and deadline-aware scheduling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Thread-base Prioritization Deadline Scheduling.</head><p>Given our focus on deadline-based scheduling, it is important to understand SCHED DEADLINE in Linux and its applicability to the edge cloud. SCHED DEADLINE adds EDF support to Linux, along with constant bandwidth server (CBS) <ref type="bibr">[18]</ref> logic to rate-limit computation. We use a process-per-client so that they can each be preemptively scheduled with separate deadlines, and use budget reclaiming to handle budget underutilization. Heavy tasks have 5ms executions, 10ms budgets, 100ms deadlines, and receive 10 requests per second. Light tasks reply immediately, have a sufficient budget of 8&#181;s, a deadline of 5ms, and receive 100 requests per second.</p><p>The utilization of the system is low (from 6% up to 50%), yet Figure <ref type="figure">2</ref> demonstrates that with an increasing number of light tasks, tail latency increases significantly. The request workload is not perfectly periodic which mimics the dynamic workload on the edge. SCHED DEADLINE relies on per-thread prioritization. Thus, aperiodic workloads will cause significant deviation in desired executions. In contrast, the CFS Linux scheduler is designed for accommodating dynamic workloads and requires no periodicity, budget, nor deadline parameters. Correspondingly, this figure demonstrates its ability to maintain tight latency properties for the light tasks.</p><p>We argue that SCHED DEADLINE under-performs CFS because (1) thread-based prioritization is not a good fit for the edge cloud which aims to meet deadlines for packets/requests;</p><p>(2) SCHED DEADLINE is a bad fit for dynamic workloads on the multi-tenant edge in which execution time and request distributions are unknown; (3) the red-black tree based implementation of SCHED DEADLINE imposes overheads ( &#167;VI-A) with many computations (over 1200). Generally, we have observed CFS to have better properties on scheduling highthroughput, dynamic workloads. Table <ref type="table">I</ref> includes a summary of some of the properties of Linux scheduling options.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. EdgeOS Background</head><p>Edge-RT adds real-time capabilities to the publicly available EdgeOS system <ref type="bibr">[9]</ref>. EdgeOS is implemented as a set of user-level components above the Composite <ref type="bibr">[19]</ref> &#181;-kernel. In EdgeOS, each tenant provides chains of computations that process client requests. EdgeOS isolates the computations into Feather Weight Processes (FWPs). Each FWP is singlethreaded, and has memory accesses restricted by hardware page-tables. EdgeOS uses DPDK for efficient networking. Two cores in EdgeOS are dedicated to poll/send packets directly from/to the NIC. Two additional cores provide FWP creation and fast message<ref type="foot">foot_1</ref> movement in an FWP chain, a service called the Memory Movement Accelerator (MMA). All other cores run preemptively-scheduled FWPs. Table <ref type="table">I</ref> summarizes key features of EdgeOS, and also how Edge-RT expands on them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. SYSTEM MODEL</head><p>Edge-RT aims to provide predictable services to clients with dynamic workloads <ref type="bibr">[20]</ref>, <ref type="bibr">[21]</ref> provided by tenant computations. As such, we make three workload assumptions:</p><p>&#8226; Tenant-provided deadlines. Each tenant provides deadlines for its services. For example, an AV manufacturer provides deadlines for chains focusing on planning or control.</p><p>&#8226; Unknown execution profiles. In contrast, it isn't practical for a tenant to understand its computation's execution time due to inaccessible edge hardware. Given this, we assume no knowledge on the edge system about average or worstcase execution time.</p><p>&#8226; Controlled utilization. This research focuses on predictable scheduling, and not on admission control. We assume load balancers sitting before computation nodes control utilization to not exceed capacity in nominal conditions. We do study the impact of malicious or erroneous clients that cause certain computation to over-consume CPU in &#167;VI-D.</p><p>Edge-RT can be extended in the future to provide workload shaping (e.g., by incorportating reservations <ref type="bibr">[22]</ref>).  the edge whose limited hardware resources require high density, multi-tenant workloads. Thus we design it to handle 1000s of stage computations. At this level of load, with fast per-message computation, even the logarithmic overheads of EDF's deadline sorting are significant. Given this, Edge-RT investigates a Constant-Time EDF (CT-EDF) design that achieves constant &#8710; impl by trading a bounded coarsening of deadline resolution. Instead of using a typical O(log(N )) balanced tree data structure sorted by the deadline, CT-EDF's core data structure is implemented using three core techniques: 1) quantize time into a set of fixed-sized quanta, &#8710; window , 2) track a relative timeline of quanta as an array, and 3) each quanta index holds a linked list of stage computations with deadlines within that quanta into the future. To find the stage with the nearest deadline, requires a scan of the array. Though this is constant (given a fixed-size array), it is expensive. Thus, inspired by fixed priority implementations, we use a trie of bitmaps to index the timeline and provide both constant and efficient lookup of the stage with earliest deadline. A side-effect of CT-EDF's quantization is that all computations with deadlines that fall into the same quantum are identically prioritized. This effectively means that messages inherit a deadline at most &#8710; window higher than its normal deadline (with reasoning similar to that for &#8710; batch ).</p><p>As overflow conditions aren't a focus of this research, we choose a simple policy for when a stage's thread misses its deadline. Threads that miss their deadline are run at a lower priority than EDF threads, and are arbitrated using roundrobin (to prevent starvation among such threads) with a quantum &#8710; timer . Once a stage's thread finishes its computation, its next activation will place it back into the normal EDF scheduling logic.</p><p>Figure <ref type="figure">5</ref> demonstrates a timeline of processing over three stages with the deadline being bounded by &#8710; window , and constant EDF contributing a small delay of &#8710; impl .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Analysis of Timing Properties</head><p>Reflecting on the model in &#167;III, Edge-RT aims to reduce per-message system overheads around &#8710; evt and &#8710; impl , while increasing &#8710; interference by a bounded amount. &#8710; evt is limited to the overhead of dequeuing from a wait-free ring buffer which is on the order of 200 cycles (depending on cachecoherency overheads from access patterns). &#8710; impl is dominated by CT-EDF which is constant and low (see &#167;VI-A).</p><p>The message passing overheads, &#8710; msg , are due to the MMA, and constitute a straightforward delay in the activation of a receiving stage. To pass messages between stages, the MMA iterates through all inter-stage connections, processes the delayed batching logic, and sends/receives notifications from each core's scheduler logic. This adds a latency into message delivery, thus the activation of computations. Though we assessed adding intelligence to the MMA to separately prioritize different computation chains to control their latency, the performance of the MMA is sensitive, and such attempts negatively impacted system throughput by complicating the tight loop. We find that in a 48-core system (details in &#167;VI) with almost two thousand computations, the latency for the MMA to cycle through all connections is around 1ms for a system at around 80% load.</p><p>There are two types of priority inversions in Edge-RT that contribute to &#8710; interference . 1) Bounded priority inversion. Event notification is performed periodically using polling ( &#167;IV-B), thus avoiding per-message overheads. However, this a message transmitted to a stage might be delayed by &#8710; timer , causing a bounded inversion. Figure <ref type="figure">5</ref> depicts this overhead as simply contributing to the delay of the next stage's activation. 2) Deadline coarsening. Edge-RT's deadline-bounded batching ( &#167;IV-A) results in a message's computation potentially inheriting an earlier deadline only if the deadlines are within a sliding window of &#8710; batch . Similarly, the quantization of deadlines to enable O(1) EDF ( &#167;IV-C) "bins" stages together with deadlines within a fixed window of &#8710; window . Figure <ref type="figure">5</ref> depicts this as the deadline for the end-to-end processing of a message being d &#8712;</p><p>Were the workload predictable, this deadline inaccuracy could be compensated by practically lowering the system's effective utilization by a factor related to di-d di . Recall that both &#8710; batch and &#8710; window are on the order of 0.5ms, which limits their impact. This deadline drives all resource management policy as the message processing flows through the system. The MMA copies messages (black squares) between the driver and FWPs, and between isolated FWPs. To wake FWPs, the MMA notifies the scheduler, passing the message's deadline. Notifications are handled with high-frequency time-triggers to avoid IPI overhead. The scheduler is a constant-time EDF that has FWPs inherit the deadline of messages. The MMA batches messages only up until message deadlines diverge by a limit to avoid interference. The scheduler notifies the MMA when additional messages can be transmitted. Finally, the MMA transfers messages to the NIC driver which transits them onto the network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. IMPLEMENTATION</head><p>Edge-RT extends EdgeOS ( &#167;II-C) which is built as a set of user-level components on the Composite &#181;-kernel <ref type="bibr">[19]</ref>. Edge-RT changes the core policies for message movement, event notification, and scheduling to focus on timely perclient execution on edge clouds. We make no modifications to Composite kernel.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Edge-RT Core Services</head><p>To reduce per-message overheads, and provide end-toend packet deadline scheduling, &#167;IV introduces per-packet deadlines, deadline-aware batching, periodic inter-core coordination and constant-time EDF scheduling. Figure <ref type="figure">7</ref> gives an overview of Edge-RT. Edge-RT uses a number of mechanisms and policies to support the implementation of endto-end packet deadline scheduling. <ref type="bibr">(1)</ref> The client chain to process a packet that arrives from the NIC is identified using a flow-table. The flow-table maps IP/port values to the chain, and the relative deadline to use for the packet. Both the chain of computations, and this deadline are provided by the tenant. <ref type="bibr">(2)</ref> The MMA is significantly enhanced to track FWP deadlines (described in &#167;IV-A) and perform deadline-aware batching. (3) The MMA interacts with per-core scheduling logic by sending FWP activation events and receiving FWP blocking notifications. (4) the scheduler is replaced with a partitioned, preemptive, EDF implementation that uses constanttime logic, and frequent timer activations and polling for event handling and inter-core coordination.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Deadline-aware, MMA-based Message Processing</head><p>Figure <ref type="figure">7</ref> depicts the flow of messages through the system. Packets received by DPDK are queued into a per-chain queue with the deadline retrieved from the flow-table. The MMA transfers this packet to the first FWP's input queue, and sends an activation notification with the deadline of the FWP to the scheduler. As such, all deadline tracking and policy, Fig. <ref type="figure">8</ref>: Flow of data (dashed lines) and control (solid lines). Top:</p><p>Message transfer from a transmitting FWP to a receiver, annotated with steps. Bottom: A timeline annotated with the various software and steps.</p><p>from flow-table to transmission, are coordinated by system components.</p><p>The MMA is central to providing end-to-end packet scheduling, and Figure <ref type="figure">8</ref> depicts the coordination between MMA and scheduler to transmit a message between stages. The MMA's core role is to transfer messages between transmission and reception rings. When an FWP enqueues a message for transfer ( 1 ), Edge-RT's MMA does the following: (1) track the deadline of each message in shadow message rings that are inaccessible to FWPs; (2) transfer a message, m i,m , from a transmission ring to a reception ring only if the deadline of each message already in the reception ring are in [d i,m -&#8710; batch , d i,m ] ( 2 ); (3) when message data is transferred between FWPs, also transfer their deadlines between shadow message rings; (4) when a transfer to an FWP is made, if there were no messages already in its ring, send an FWP thread activation notification to its core's scheduler logic ( 3 ); and (5) receive notifications from each core's scheduler logic for when an FWP blocks awaiting messages, which enables the MMA to transfer the next deadline-limited batch messages.</p><p>The scheduler will execute an activated FWP according to its EDF policy ( 4 ). Note that if a receiving FWP is already active, messages transferred to it are batched, and avoid event notifications.</p><p>A side-effect of the MMA's limitations on batching is that back-pressure is provided naturally. If an FWP is not executed, any messages awaiting transfer into it (with deadlines higher than &#8710; batch + d i,m ) are not transferred out of the upstream FWP. This logic repeats until the receive ring of NIC driver cannot receive any more packets for the flow, thus dropping packets. Though this outcome isn't ideal, deadlineaware work shedding is necessary in an over-committed system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Inter-core Coordination</head><p>FWPs on separate cores interact through the MMA and its coordination with core scheduling logic. Conventional implementations of inter-process coordination use shared datastructures and IPIs with the accompanying costs <ref type="bibr">[35]</ref>. In contrast, Edge-RT's design minimizes these potentially permessage overheads by remaking OS coordination primitives based on message passing and frequent periodic polling for events. Edge-RT implements a new user-level scheduler <ref type="bibr">[36]</ref>, <ref type="bibr">[37]</ref> that integrates with the MMA to enable this coordination. This scheduler's key functionalities include the following.</p><p>(1) When switching to a thread (i.e. FWP), it passes a timeout in cycles which is used to program the LAPIC timer, at which point a timer interrupt will reactivate the scheduler. We reprogram the timer to happen periodically at &#8710; timer to bound notification latency. (2) The scheduler currently uses simple partitioned scheduling. Each core's scheduler code shares two wait-free ring buffers with the MMA used to pass notifications in both directions. When the timer activates the scheduler, it polls for FWP activation notifications from the MMA and handles them. Each notification includes a deadline with which to execute the FWP. As such, this tight coordination between MMA and scheduler enables the thread's inheritance of message deadline. (3) When an FWP blocks after finishing message processing, the scheduler uses the other ring to send a corresponding notification to the MMA. This enables the MMA to resume transferring messages to the FWP if more were pending, but not sent due to the deadline-aware batching.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Constant-Time Earliest Deadline First Scheduling</head><p>A traditional implementation of EDF sorts threads by deadline often using a balanced binary tree (e.g., a min-heap or red-black tree). Most real-time systems contain a small number of tasks (on the order of 10s or low 100s), so any datastructure overheads are sufficiently small. However, Edge-RT is focused on dense, multi-tenant execution of chains of concurrent computations, thus must be able to run on the order of 1000s of threads (&gt;6000 in &#167;VI). This has two negative effects: <ref type="bibr">(1)</ref> with such high density, the frequency of FWP activation/blocking -bounded only by packets per-second -forces frequent scheduling decisions, and (2) the number of active threads makes the per-scheduling decision sorting overheads have a non-trivial impact.</p><p>CT-EDF enables constant time and fast EDF decisions. The key idea is that the timeline is quantized into fixed quanta of a specific length, &#8710; window , and is represented as a circular array (i.e. with wrap-around logic) with an entry-per-quantum. If the array has S entries, it tracks up to &#8710; window &#215; S time units into the future. The current time, C, defines an index into the array, and all times into the future are indexed from there (treating the array as a circular array). Calculating times relative to the current time means that at each quanta, only C needs to be incremented rather than shifting all entries down. Finding the FWP with the lowest deadline means finding the first quantum that contains an FWP starting the search at the index C. At each quanta, any tasks that miss their deadlines are placed into the lower-priority, round-robin queue. Future enhancements can include more intelligent failure logic. In this paper, we restrict our focus on meeting end-to-end packet deadlines.</p><p>The data-structures for CT-EDF are depicted in Figure <ref type="figure">9</ref>. Finding the first quantum into the future involves iterating through potentially the entire array, which is unacceptably expensive. As such we use a set of bitmaps to index the quanta that have runnable FWPs. If a quantum in the timeline has runnable FWPs, the bit corresponding to that quanta is 1. We use the clz instruction to efficiently count the leading zeros within the bitmap. Additionally, to avoid needing to iterate through the words of the bitmap -using clz on each -we define multiple levels (2 in our case) of bitmaps that each index the next level. At each level, a bit is 1 if the corresponding next level bitmap indexes a quantum with runnable FWPs. This multi-level bitmap index forms a radixtrie, and enables constant-time, efficient identification of the earliest-deadline FWPs (with accuracy within &#8710; window ).</p><p>Edge-RT uses a 32-bit level-1 index, that index 32, 32 bit level-2 indices (for a total of 1024 tracked quanta). This enables using only two clz to determine the earliest deadline. Each CT-EDF quantum is 0.5ms for a maximum deadline of 0.5s. The timeline can be extended by adding another index level (to support 16s), or increasing the quantum.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. EVALUATION</head><p>This section evaluates Edge-RT relative to EdgeOS and Linux. All experiments use a PowerEdge R740 servers with two socket Intel(R) Xeon(R) Platinum 8160 CPUs @ 2.10GHz each with 24 cores. We use an Intel X710 for 10GbE SFP+ for networking and a similarly equipped client drives workload generation. DPDK and OVS versions for the results in &#167;II-A are 19.11 LTS and 2.13, respectively. Linux results use kernel version 5.4. We don't use the PREEMPT RT patches as tight interrupt response times (on the order of a &#181;s) are not the focus. Client machines use a modified memblaster to generate workloads and measure round-trip latencies. Edge-RT uses DPDK version 17.11.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. CT-EDF Overheads</head><p>CT-EDF uses a constant-time data-structure to find the task with the earliest deadline ( &#167;IV-C). Figure <ref type="figure">10</ref>     the overhead of insertion and removal for scheduling datastructures. We compare the following. (1) A traditional O(1) fixed priority, round-robin (O(1) fixed priority) structure which includes an array of linked lists that track threads within a priority. It uses a two-level bitmap index -queried with the clz instruction -that tracks priorities with active threads.</p><p>(2) CT-EDF is detailed in &#167;V-D. (3) A traditional EDF implementation (Binary tree EDF) using a red-black tree to sort threads by deadline. Not shown here, we used an in-place heap and found it to have more overhead. Threads with uniformly random deadlines are added to the queue (i.e. they are woken), and the thread with the highest priority (earliest deadline) is removed from the queue (i.e. blocks). Discussion. With an increasing number of threads, the traditional EDF with O(log(N )) complexity imposes increasing overheads, while the other two approaches demonstrate nearconstant overheads. The CT-EDF policy demonstrates higher overhead than fixed priority when inserting, despite using a similar index. This is due to wrap-around logic in CT-EDF: 1) if the current time is half-way into the timeline, the clz operations must be performed only on indices after a circular shift has been performed; and 2) at the second level index node, it is possible not all bits should be queried (if the current time is indexed by this node).</p><p>B. Linux and Edge-RT Utilization Sensitivity &#167;II argues that system organizations and policies that target high throughput such as kernel-bypass and eBPF, or predictable execution such as SCHED DEADLINE, have trouble scaling to multi-tenant, deadline-driven environments. Table <ref type="table">I</ref> summarizes these arguments. Figures <ref type="figure">2</ref> and<ref type="figure">1</ref> demonstrate that Linux's CFS scheduling, and normal sockets perform the best with increasing tenants and uncertain workloads. Therefore, we configure Linux using sockets for networking and CFS for scheduling. To provide the same level of isolation as Edge-RT, we use separate processes for each stage in Linux. For chain processing on Linux, pipes slightly outperform socket variants for event notification latency. Thus, we use pipes for IPC between processes in chains. To avoid data copying costs, we pass only 8 byte messages to provide event notification while larger messages use shared memory.</p><p>In this evaluation, we configure both Linux and Edge-RT to use 48 cores, with Edge-RT specializing four cores ( &#167;II-C). Thus Edge-RT has a practical maximum utilization of around 91.6%. Figure <ref type="figure">11</ref> depicts a bimodal workload with a varying system utilization. Utilization is reported as a fraction of application computation of the 48 cores. Edge-RT is configured with &#8710; batch = 8ms and &#8710; timer = 250&#181;s guided by Figure <ref type="figure">6a</ref> and Figure <ref type="figure">6b</ref>. On the client side, we use the same bimodal workload from Figure <ref type="figure">6a</ref> with light tasks that execute 40&#181;s, and heavy that execute 5ms with deadlines 10ms and 500ms, respectively. As a result, light tasks comprise 80% of the total execution of the workload. We control the utilization of the workload by adjusting the sending rate (proportionately for light and heavy) on the client side. We use 480 clients/chains with 1920 FWPs.</p><p>We also compare against Linux using SCHED DEADLINE. To admit enough tasks onto the system to be able to use 100% utilization, we set the CFS budgets equal to the average execution time for heavy and light. We omit these results as light tasks meet only 0.25% of their deadlines at 60% utilization while even heavy tasks only meet 57% of deadlines at 60% utilization, decreasing to 32% at 100% utilization. This provides further evidence for the claims in &#167;II-A that SCHED DEADLINE is not a good fit for systems with dynamic workloads that are not periodic. Discussion. Figure <ref type="figure">11a</ref> shows high-throughput systems have difficulty meeting deadlines. As the utilization increases from 60% to 100%, the percent of deadlines met by Linux drops from 95.9% to 15.9%. Additionally, Linux starts dropping packets on light flows when utilization grows over 90%. In case of heavy tasks in Figure <ref type="figure">11b</ref>, Linux does not drop packets, and meets deadlines even under 100% utilization. Linux's best-effort focus on fairness favors heavy workloads despite CFS's heuristic prioritization of I/O-bound workloads.</p><p>EdgeOS meets most of the deadlines when the system is relatively low utilized (less than 80% utilization). Since EdgeOS is not deadline-aware and uses a fixed-priority round-robin scheduler, it misses most of the deadlines when utilization grows over 80% and barely meets any deadline at 100% utilization. Recall that the four specialized cores decrease the effective maximum utilization to around 91%. However, EdgeOS drops fewer packets for heavy tasks compared to Edge-RT at 100% utilization. This is due to EdgeOS's focus on throughput. At 90% utilization, Edge-RT meets over 95% deadlines on both light and heavy computations. Even at 100% utilization, Edge-RT meets 92.13% deadlines on light tasks (compares to Linux's 15.95% and EdgeOS's 0%). In contrast, heavy tasks cause backpressure through chains, and Edge-RT drops 25.6% with continued progress on other tasks. This demonstrates Edge-RT's focus on end-to-end, deadlinecentric scheduling, while still maintaining high throughput.</p><p>We don't compare against the kernel bypass techniques given challenges with non-preemptive client execution in &#167;II-A. The round-trip latency for Linux sockets is only 20&#181;s (15%) slower than DPDK using interrupt mode (which is necessary with more tenants than cores), so we believe that Linux is a competitive environment for these comparisons.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Scalability</head><p>Edge-RT is designed to use constant overhead mechanisms where possible. CT-EDF and periodic activation processing both enable schedulers to spend more time running FWPs regardless the number of FWPs. Thus, we want to evaluate the performance of Edge-RT when scheduling thousands of FWPs. Here we evaluate Edge-RT's and Linux's ability to control latency while increasing the number of processes. We use the same system setup as the previous bimodal tests, by default. As we scale up the number of tasks, we proportionately adjust the sending rate for light and heavy tasks to keep the system utilization equal to 50% to avoid either system dropping packets. Thus, the 99th percentile latency depicts the system's ability to bound latency. Discussion. As shown in Figure <ref type="figure">12</ref>, with 1000 computations, Linux and Edge-RT have similar tail latencies around 5ms. Edge-RT maintains a flat tail latency between 4 and 5ms up to 5000 tasks. In contrast, the tail latency of light tasks on Linux grows as the number of tasks increases. At 5000 tasks, the tail latency of light tasks in Linux is 18ms (recall: deadline 10ms). The tail latency of heavy tasks for both Linux and Edge-RT increases when having an increasing number of tasks in Figure <ref type="figure">13</ref>, but both remain below their deadline of 500ms. Linux slightly outperforms Edge-RT on heavy tasks, because CFS's fairness ensures constant progress for heavy tasks over light tasks. These results demonstrate that Edge-RT can maintain its focus on end-to-end deadline scheduling despite significantly scaling up the number of clients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Unpredictable Workloads</head><p>The edge must be able to handle workloads with unpredictable execution. Tenants can provide code that attempts to monopolize CPU resources, and clients can provide inputs to cause errant behavior. To evaluate the impact of execution overruns, we maintain a consistent workload, with an increasing number of malicious tasks. We use 48 cores and 768 clients with 768 FWPs. System utilization is 80% without malicious tasks. The well-behaved tasks execute for 5ms and clients send a request every 100ms with a 500ms deadline. Chain lengths are set to one (K = 1) to emphasize that malicious tasks can cause significant interference even without back-pressure from chains of computations. Malicious tasks are simply infinite loops to monopolize the CPU. Figure <ref type="figure">14</ref> shows the behavior of well-behaved tasks in the presence of CPU-hogging malicious tasks. In these results, we filter out the deadlines missed due to the malicious tasks themselves. Discussion. With even a small number of malicious tasks, EdgeOS's round-robin scheduling policy -that focuses on progress and fairness -demonstrates an inability to adequately isolate the deadline-sensitive tasks from those that are simply CPU-hogs. At only two malicious tasks per core, on average, nearly all deadlines are missed.</p><p>CFS in Linux fairs better with this workload. This is because it attempts to prioritize newly arrived (I/O-bound) threads over malicious tasks who are CPU-bound. As such, it is able to meet 63% of the deadlines with two malicious tasks present, but quickly degrades to 39% with four.</p><p>Once a deadline is missed in Edge-RT, the FWP is deprioritized and moved to a best-effort round-robin low priority queue. This simple policy is relatively effective at constraining this malicious computation by deprioritizing it. This demonstrates the unintuitive benefit of a end-to-end, deadline-centric policy for the edge: deadlines provide semantic information about expected execution behavior. The edge can use them to constrain FWP's negative behaviors when deadlines are overrun. Though stronger assumptions about FWP execution times could further constrain their impact -for example, by enabling rate-limiting -such assumptions are challenging in the high-throughput, dynamic environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. RELATED WORK</head><p>Edge applications. Offloading computation from devices to infrastructure has a long history. For example, it has been shown to prolong batteries <ref type="bibr">[38]</ref>, <ref type="bibr">[2]</ref>, and even service missioncritical tasks <ref type="bibr">[39]</ref> with degraded local computation. It enables global coordination to get benefit beyond a system's local sensors <ref type="bibr">[40]</ref>. Edge-RT assumes an unreliable network, but one that is low-latency enough to motivate deadline-sensitive computation. It attempts to enable these benefits in a highthroughput, high-density, and real-time infrastructure. This assumption does not prevent Edge-RT from including fallback logic to enable detection and retransmission of dropped packets in the future. Shared hardware resources. Edge-RT focuses mainly on sharing NIC hardware resources. FWPs share no memory as the MMA arbitrates copying messages between them. We do not focus on shared resources like caches <ref type="bibr">[41]</ref>, <ref type="bibr">[42]</ref>, <ref type="bibr">[43]</ref>, <ref type="bibr">[44]</ref> and memory buses <ref type="bibr">[45]</ref>. Approaches that increase their sharing and access are complementary to Edge-RT. Data-age in cause-effect chains. Chains of computation are a common abstraction in embedded systems (e.g., robotics systems such as ROS) and represent stages of computation from sensing to actuation. They are often called cause-effect chains, and real-time systems are concerned with controlling and constraining the data-age (the latency of chain computation from sensing to actuation).</p><p>Large bodies of work have focused on creating policies and analysis to provide predictable, bounded data-age across chains <ref type="bibr">[46]</ref>, <ref type="bibr">[47]</ref>, <ref type="bibr">[48]</ref>, <ref type="bibr">[49]</ref>, <ref type="bibr">[50]</ref>. Recent work <ref type="bibr">[51]</ref> focuses on end-to-end response time analysis of event chains. This work is often concerned with strong predictability and meeting all deadlines, and does so by assuming knowledge about execution properties such as fixed rates and WCETs.</p><p>Edge-RT is focused on controlled latency in a highthroughput environment. Instead of the traditional approach that often separately prioritizes chain computations, controls their periodicity, and orchestrates dependencies, Edge-RT focuses on end-to-end deadline scheduling of messages, rather than of computations.</p><p>Blass et al. <ref type="bibr">[52]</ref> take a more dynamic approach by monitoring latencies and dependencies in chains, and dynamically updating priorities, reservations, and affinities to control endto-end latency. Edge-RT doesn't attempt to optimize parameters to control latencies, and instead explicitly maintains endto-end deadline scheduling mechanisms and policies. Network Function Virtualization. The edge must execute not only multi-tenant, offloaded computation, but also Network Functions (NFs) for slicing, transformation, and shaping. Network Function Virtualization (NFV) <ref type="bibr">[53]</ref> provides a software environment for the isolated execution of NFs. Edge-RT is based on EdgeOS <ref type="bibr">[9]</ref>, which supports high-throughput, high density NFV, but additionally aims to provide strong latency properties. Similarly, systems have focused on soft-real-time NFV computation <ref type="bibr">[54]</ref>, <ref type="bibr">[55]</ref>, but are not focused on high-throughput systems nor resource management focused on average computation (not WCET). Rethinking OS-level batching and scheduling. The trade-off in batching between throughput and latency is well known and has enabled efficient core specialization <ref type="bibr">[56]</ref>, and a successful decoupling of OS control and datapaths <ref type="bibr">[27]</ref>. It has also been an integral technique in shaping back-pressure <ref type="bibr">[57]</ref>, controlling cross-core interference <ref type="bibr">[35]</ref> in real-time systems. It has been paired with fine-grained inter-core coordination to control &#181;-second level latencies in the presence of headof-line blocking <ref type="bibr">[58]</ref>. User defined scheduling <ref type="bibr">[59]</ref>, <ref type="bibr">[30]</ref> is another approach which aims to reduce latency by deploying a user-defined scheduling policy. Edge-RT is inspired by these approaches, and uses batching to achieve high throughput, but constrains the impact that large batches of processing can have on the latency of other system computations.</p><p>Summary. Edge-RT's contributions are in a less-investigated direction for real-time systems: high-throughput, network systems that aim to meeting deadlines despite high-density, multi-tenancy. The techniques it must use to achieve all of these goals aim to practically meet end-to-end deadlines while making efficient use of limited resources.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. CONCLUSIONS</head><p>In this paper we introduce the Edge-RT. It represents a significant step into the direction of novel real-time policies and mechanisms for high-throughput, and high-density multitenant systems. Edge-RT introduces a practical policy to control latency over a chain of isolated computations: deadlines are associated with packets/messages, and computations inherit them, enabling optimization toward controlled end-toend latency. We investigate policies that increase throughput, while having limited impact on latency. Results demonstrate that the system can effectively handle significantly higher load while meeting deadlines, as the number of clients scales up, and in the presence of malicious tasks, thus enabling higher edge density and decreasing the amount of computational resources at each basestation. We believe this marks a significant advance toward principled latency management for the multi-tenant edge.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>This material is based upon work supported by the National Science Foundation under Grants No. CNS 1815690 and CPS 1837382, through SRC under grants GRC task 2911.001 and SRC JUMP task 2779.030, and ONR N000142212084. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of these agencies.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>We'll use the terms messages and packets synonymously.</p></note>
		</body>
		</text>
</TEI>
