<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Enabling Resilience in Virtualized RANs with Atlas</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>10/02/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10467036</idno>
					<idno type="doi">10.1145/3570361.3613276</idno>
					
					<author>Jiarong Xing</author><author>Junzhi Gong</author><author>Xenofon Foukas</author><author>Anuj Kalia</author><author>Daehyeok Kim</author><author>Manikanta Kotaru</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Virtualized radio access networks (vRANs), which allow running RAN processing on commodity servers instead of proprietary hardware, are gaining adoption in cellular networks. Two properties of the vRAN's "Distributed Unit (DU)" that implements the lower RAN layers-its real-time deadlines and its black-box nature-make it challenging to provide resilience features such as upgrades and failover without long service disruptions. These properties preclude the use of existing resilience techniques like virtual machine migration or state replication that are used for typical workloads. This paper presents Atlas, the first system that provides resilience for the DU. The central insight in Atlas is to repurpose existing cellular mechanisms for wireless resilience, namely handovers and cell reselection, to provide software resilience for the DU. For planned resilience events like upgrades, we design a novel technique that simultaneously serves cells from both the old and new DUs via the same radio, and uses handovers between these cells to migrate user devices. For unplanned failures, we identify deficiencies in existing RAN protocols that disrupt cell reselection after DU failure, and show how we can eliminate these disruptions using a middlebox between the DU and higher layers. Our evaluation with a state-of-the-art 5G vRAN testbed shows that Atlas achieves minimal disruption to cellular connectivity during resilience events, while incurring low overhead.
CCS CONCEPTS• Networks → Mobile networks; Wireless access points, base stations and infrastructure; • Computer systems organization → Real-time systems; Reliability; Availability.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The emergence of vRANs <ref type="bibr">[3,</ref><ref type="bibr">13,</ref><ref type="bibr">49]</ref> brings new promises and challenges for cellular network operators. On the one hand, softwarization of the RAN promises benefits like increased feature velocity, a larger vendor ecosystem, security, and reduced CapEx/OpEx. On the other hand, it opens up a variety of unexplored problems that naturally arise from running the vRAN's real-time and often black-box software on commodity Linux servers.</p><p>This paper focuses on one such problem: the lack of resilience in today's vRANs. Cellular networks provide a critical infrastructure that is relied upon for emergency services and mission-critical applications. However, there are no methods for vRAN maintenance or upgrades, or to recover from crashes in vRAN software or hardware, without causing significant service disruption lasting many seconds to minutes ( &#167;7). Operators therefore rely on planned late-night downtime windows for maintenance, and accept outages during crashes. This limits the benefits of vRANs, e.g., since version upgrades or security patches are difficult to apply. This paper presents Atlas, the first system to provide resilience for the vRAN's real-time component, called the "Distributed Unit" (DU). The DU is deployed in far edge datacenters close to the radios, and consists of the lower layers of the vRAN stack, including the Physical (PHY), Media Access Control (MAC), and Radio Link Control (RLC) layers. Atlas provides a new primitive called "DU migration", which allows moving the PHY-RLC processing for a cell from one DU to another DU on a different server. With Atlas' proactive migration, operators can upgrade or service a DU without downtime, seamlessly moving each user device (also called user equipment, or UE) attached to the source DU to another DU. Atlas' reactive migration handles abrupt hardware and software failures, with a sub-second disruption.</p><p>There are two main challenges in making the DU resilient. First, the DU has strict sub-millisecond real-time latency deadlines, which precludes the use of general-purpose resilience approaches like virtual machine or container migration that require pausing the running software for hundreds of milliseconds. Second, since the DU requires highly complex and optimized software written by domain experts, it is often a black box where the source code is proprietary or unavailable. For example, Radisys' minimal open-source DU that implements only a small subset of DU functionality contains 180K source lines of code <ref type="bibr">[17]</ref>. This precludes resilience techniques used for higher layers of the cellular stack that significantly modify the source code to externalize all computation states into a reliable state store <ref type="bibr">[29,</ref><ref type="bibr">40]</ref>.</p><p>Our observation behind Atlas is we can use cellular networks' existing mechanisms for wireless-level resilience to provide software-level resilience. For example, during proactive DU migration, we need to move UE processing from the source DU to the destination DU; this happens naturally during UE handovers, where UEs disconnect from one DU and reconnect to another. Similarly, during reactive DU migration, we need to re-create UE sessions at the destination DU; this happens naturally during a process called "cell reselection", where UEs lose coverage from their original DU (e.g., after turning a sharp corner while walking), and then quickly discover and reconnect to a different DU.</p><p>Using the existing resilience mechanisms in Atlas presents two challenges. First, for proactive migration, we need a way to serve two cells-one each for the source and destination DUs-from the same radio unit (RU) during a transient period. This is challenging because the entire vRAN stack is designed assuming a one-to-one RU-DU association. We investigate intuitive RU-sharing approaches, e.g., via splitting only along the time and frequency dimensions, and discuss their infeasibility. Instead, we design a novel lightweight RU-sharing mechanism that also splits along the spatial dimension by using the RU's multiple antenna ports, and exploits special properties of cellular control channel signals. We implement the RU sharing by introducing a vendor-agnostic fronthaul network function for multiplexing the fronthaul traffic of the DUs, and by introducing radio resource scheduling logic for mitigating interference at the MAC layer.</p><p>Second, for reactive migration, we identify 5G RAN protocol limitations, caused by their design being agnostic to DU failures. We show how these limitations can delay UEs' reconnection to a backup DU after the primary DU fails, and cause the UEs to subsequently disconnect. In lieu of the standards' protocol fixes, we present a clean and lightweight midhaul network function that brings DU failure awareness to the "Centralized Unit" (CU) above the DU, enabling fast failover.</p><p>To demonstrate the generality and portability of our design, we implement Atlas' fronthaul NF for three different targets (an eBPF hook in Intel FlexRAN PHY, DPDK-based software middlebox, and Tofino switching ASICs), and we realize the interference mitigation logic at the MAC layer by leveraging parameters exposed by the open O-RAN E2 interface. We also implement the midhaul NF in C++ as a lightweight process running on the CU server. Our implementation requires minimal modifications to the vRAN software, which is well aligned with 3GPP and O-RAN specifications. We evaluate Atlas in a production-grade 5G vRAN testbed with commercial UEs. Our experiments show that Atlas supports proactive migration with zero UE downtime, and reactive migration with as little as 700 ms of downtime. In both cases, UEs regain their full performance after migration finishes. We show how Atlas can be implemented with near-zero compute and latency overhead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">MOTIVATION AND BACKGROUND 2.1 A primer on virtualized RANs</head><p>RAN components. Figure <ref type="figure">1</ref> illustrates a typical 5G vRAN deployment, with its three main components: the Radio Unit (RU), the Distributed Unit (DU), and the Centralized Unit (CU). The RU is typically implemented in fixed-function hardware (e.g., ASICs or FPGAs), while the DU and CU are software applications running on commodity servers. Each RU constitutes a "cell", and connects to the DU via a fronthaul network. For this work, and without loss of generality, we use the terms RU and cell interchangeably to indicate a oneto-one mapping of a cell to the radio hardware.</p><p>The "virtualized" DU and CU serve the various RAN protocol layers (see below). The DU has strict real-time latency requirements and therefore runs close to the RUs. The CU is delay-tolerant and can therefore run farther away from the RU. A DU server typically hosts several (e.g., ten) cells, and a CU server hosts hundreds of DUs. A RAN Intelligent controller (RIC) facilitates the programmability of the RAN. RAN protocol layers. RAN functionality is divided into several layers, each responsible for a distinct set of control and/or data plane operations (Figure <ref type="figure">1</ref>). The DU implements the real-time layers, e.g., the PHY layer for wireless signal processing, and the MAC layer for scheduling radio resources among UEs. The CU implements the more delay-tolerant layers, such as the Radio Resource Control (RRC) layer that manages radio-related UE operations, including handovers that move the UE from one cell to another.</p><p>Open RAN interfaces. The RAN components communicate with each other via a set of standardized interfaces, designed for interoperability by consortiums like 3GPP and the O-RAN Alliance <ref type="bibr">[9]</ref>. Three DU interfaces are relevant to this work: the xRAN fronthaul interface with the RU, the F1-C interface with the CU's control plane, and the E2 interface with the RIC. The xRAN interface that carries digitized radio signals (IQ samples) has real-time latency requirements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Need for resilience in vRANs</head><p>The cellular network is a crucial piece of infrastructure that needs to be highly available to support critical applications, such as public safety and factory automation. Resilience has been studied extensively for the cellular network's nonreal-time components, i.e., the mobile core and CU (e.g., <ref type="bibr">[29,</ref><ref type="bibr">40,</ref><ref type="bibr">44,</ref><ref type="bibr">46,</ref><ref type="bibr">47]</ref>). Supporting resilience for the DU, however, poses unique challenges due to its strict real-time latency requirements and black-box nature, with no solution existing at this time. We divide DU resilience events into two types: Planned events. A key promise of vRANs is the ease of (1) rolling out updates, such as new RAN features and OS-/security patches; and (2) hardware servicing via planned maintenance. These happen frequently, e.g., according to AT&amp;T, some subsets of their RANs are upgraded daily, but with pre-planned downtime <ref type="bibr">[45,</ref><ref type="bibr">51]</ref>. However, due to the lack of resilience mechanisms for vRAN DUs, such updates today create significant downtime for users, lasting several seconds to minutes (see &#167;7). Unplanned events. The vRAN must quickly recover from unplanned failures like software crashes or hardware malfunctions. Mean time of server hardware failures can range between 10 and 60 days <ref type="bibr">[7,</ref><ref type="bibr">15]</ref>, with repairs taking several hours <ref type="bibr">[7]</ref>. If a vRAN DU server fails, UEs attached to the DU should quickly be migrated to another DU server.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Requirements for resilient vRANs</head><p>Ideally, a system that supports resilience for vRANs should provide the following three properties: Minimal downtime during resilience events. To allow operators to perform vRAN upgrades and security patches without concern for their impact on user connectivity, the system must incur zero downtime during planned resilience events. For failures, it must keep downtime comparable to wireless-related service interruptions that users naturally experience, e.g., due to handover failures. Minimize overhead and cost. Under normal operations, it must add near-zero overhead. vRANs operate with tight CPU and latency budgets, so any overhead must be carefully considered. Cost should also be kept to a minimum, considering the large scale of DU deployments (e.g., Dish and Rakuten report more than 5K DU sites in their networks <ref type="bibr">[38,</ref><ref type="bibr">48]</ref>).</p><p>Vendor agnostic. It must be compatible with any standardscompliant vRAN implementation without modifications, for two reasons. First, commercial-grade vRAN software is typically proprietary, written by domain experts, and has extreme complexity. Recreating it from scratch is not feasible. For example, even the simplified open-source reference implementation of OpenAirInterface <ref type="bibr">[4]</ref> has more than 600K lines of code. Second, there are numerous vRAN implementations for different hardware architectures (e.g., GPUs and SoCs), so a vendor-specific approach has limited applicability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Limitations of existing RAN resilience approaches</head><p>We now discuss how existing cellular resilience approaches are insufficient for DUs. While no existing techniques target the DU directly, we discuss their natural DU adaptations.</p><p>Offloading UEs to neighbor cells. In traditional RANs, each cell site has a separate DU appliance. This contains the impact of DU upgrades or failures to one cell site, since these events can often be handled by offloading affected UEs to neighboring cells <ref type="bibr">[45,</ref><ref type="bibr">51,</ref><ref type="bibr">52]</ref>. For planned events, the network can do this gracefully using handovers. For unplanned events, UEs can re-attach to a neighboring cell. Adapting neighbor cell offload for DU resilience has several limitations. First, it is feasible only with dense cell coverage, which is not always available, e.g., in sparse rural and enterprise deployments. Second, for unplanned failures, UEs fail to reconnect reliably to the neighboring cell due to the failure-agnostic nature of existing 5G protocols ( &#167;5.2). Third, ensuring good overlapping coverage requires maintenance windows, as well as complex cell planning and antenna management to adjust the power and tilt of RU antennas <ref type="bibr">[51]</ref> during maintenance. Fourth, UEs experience reduced throughput due to worse signal quality from the neighbor, e.g., up to 25% worse TCP throughput in our indoor testbed ( &#167;7).</p><p>In addition to the above issues, due to the small size of vRAN datacenters, it is not always possible to even place neighboring cells' DUs on different servers. Unlike traditional RANs, one DU server may host multiple cell sites, and a server failure may bring down all sites providing coverage overlap. Imposing physical world constraints on DU placement also goes against the principles of virtualization, which treats every server as a homogeneous resource. Such constraints will limit the benefits of virtualization, such as load-aware dynamic RU-to-server bin packing (called BBU pooling <ref type="bibr">[42]</ref>), an important vRAN feature for efficiency. Fault-tolerant state store. One technique to build resilient network functions is to externalize their state in a replicated, fault-tolerant state store. For example, ECHO <ref type="bibr">[40]</ref> uses a keyvalue store to replicate the state of an LTE Evolved Packet Core (EPC), which is delay-tolerant. This approach does not apply to the DU for two reasons. First, these techniques require extensive modifications to the entire DU source code, making it vendor-specific. Second, the latency of faulttolerant key-value stores-roughly 100 &#181;s tail latency-takes away a significant portion of the DU network functions' TTIsized (500 &#181;s) processing budgets. For example, the MAC layer alone may generate tens of state updates in every TTI.</p><p>Creating a new DU instance. One can consider using a general-purpose resilience mechanism provided by a cluster orchestration system such as Kubernetes <ref type="bibr">[14]</ref> for DU resilience. In this approach, the orchestration system handles updates and failures by creating a new DU instance (e.g., Kubernetes brings up a new DU pod that runs the DU software stack) and starts routing traffic to the new instance. Such approaches are not designed for real-time systems, e.g., in our experiments the long startup time of Kubernetes results in UE disconnection lasting over 50 seconds ( &#167;7). Stateless migration to a hot-standby DU. Slingshot <ref type="bibr">[32]</ref> shows how the DU's PHY processing can be seamlessly migrated to another PHY without transferring PHY state between the two. This works because the PHY has no long-lived state that affects UEs. In contrast, the DU layers above the PHY (i.e., MAC and RLC) maintain long-lived UE state (e.g., RLC acknowledgement tracking). Our experiments in &#167;7 show that simply bringing up a hot-standby DU disconnects attached UEs for over three seconds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">ATLAS OVERVIEW</head><p>Atlas provides a new primitive that we call "DU migration". A DU typically serves multiple cells, each connected to a separate RU. Since Atlas migrates all cells of a DU between servers during resilience events, we use the terms cell and DU interchangeably.</p><p>To provide resilience, Atlas creates an illusion of two colocated cells providing the same coverage and signal quality by exposing a single RU to two DUs-the source and destination DUs for migration. Building on this illusion, Atlas uses standard 3GPP mechanisms to move UEs between cells: 1. Proactive migration &#8764; handovers: We repurpose the handover mechanism <ref type="bibr">[5]</ref> to seamlessly migrate UEs from the source cell to the destination cell. 2. Reactive migration &#8764; recovery from handover failures: We repurpose the cell reselection mechanism <ref type="bibr">[6]</ref>, originally designed to recover from handover failures, to quickly reattach UEs to the destination cell.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Key ideas</head><p>3.1.1 Efficient and interference-free RU sharing.</p><p>For proactive migration, the main challenge is moving UEs to the destination cell without disrupting their connections. Given the limited radio resources and the inherent 3GPP handover protocol delays, not all UEs can be migrated simultaneously. This limitation of handovers introduces a transient period during which some UEs may have moved to the destination DU, while others still remain connected to the source DU. The duration of this transient period depends on the vendor-specific handover procedure implementations (e.g., the time gap between two consecutive handovers, the number of simultaneous handovers that can be triggered, etc.). As a result, to guarantee zero downtime, the source and destination DUs must co-exist during the transient period, which presents two challenges.</p><p>(1) Spectrum sharing. Allowing both DUs to transmit simultaneously causes radio interference, which significantly degrades performance. (2) RU hardware sharing. Today's commodity RUs support only one DU. Provisioning a spare RU for redundancy has a prohibitive cost. We solve these problems with the following two contributions. First, we develop a method for sharing the wireless spectrum between the two DUs during migration with minimal interference. For user data transfers, we observe that we can configure the MAC schedulers to multiplex user data for the two DUs in the time domain by scheduling them in different TTIs. For the time-sensitive downlink control channel signals that cannot be similarly re-scheduled, we exploit their inherent robustness to interference and multiplex them in the spatial antenna domain.</p><p>Second, we develop a method for two DUs to share an RU's fronthaul link, giving the illusion that each DU communicates with a separate physical RU. We implement Atlas' RU sharing logic in a "logical" fronthaul network function (NF) that manipulates and forwards fronthaul packets to/from the RU. We provide three alternative fronthaul NF implementations, tailored to deployments with different hardware, compute, and source-code modification requirements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2">Fast UE reconnection after failure.</head><p>How quickly can a UE reconnect to a backup DU after the primary DU's failure? We make two contributions to answer this question. First, we perform a detailed measurement of the timing of UE actions during cell reselection. We show that when adapted well for DU failovers, this can provide a downtime comparable to normal recovery from failed handovers, which UEs experience occasionally in normal operation.</p><p>Second, we show how existing 5G protocols agnostic to DU failures. Although UEs can quickly reattach to a second DU after losing connectivity with a first DU that remains operational, UEs cannot regain connectivity when the first DU is no longer operational. As we show in Section 5, long protocol timeouts delay the UE's initial re-attachment attempt, and subsequent timeouts completely disconnect the UE. We identify these protocol limitations along with possible fixes for the standards in the future. In the meantime, we design a lightweight midhaul control plane NF that interposes on the CU-DU control plane traffic to bypass these limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Atlas architecture</head><p>Figure <ref type="figure">2</ref> illustrates a logical view of Atlas' architecture, consisting of the Atlas controller, a fronthaul NF, and a midhaul control plane NF. Several agents run on the Atlas controller; these are responsible for initiating DU migration by notifying the fronthaul/midhaul NFs, for interacting with the DU and CU through hooks that expose parameters for the scheduling of radio resources and handovers, as well as for receiving telemetry feedback about the state of the DUs and the CU. The controller communicates with the NFs and the DU through an RPC channel.</p><p>The Atlas NFs are logical entities that can run in different physical locations. The fronthaul NF modifies/forwards/drops fronthaul packets for RU hardware sharing, and redirects the fronthaul traffic from the primary DU to the backup DU for failover. We present three implementation options for the fronthaul NF: i) inline in the DU process using userspace eBPF hooks, ii) on an existing programmable top-of-rack switch, and iii) on a software switch. The midhaul NF intercepts the SCTP connection between DUs and the CU to provide the CU with early DU failure notifications. The midhaul NF can reside in the CU, or a separate process; we currently implement the latter approach. DU configuration. The source and destination DUs for migration connect to the same CU, and they both have the same configuration in terms of the physical layer grid (e.g., bandwidth, TDD/FDD config, and MIMO capabilities). A key difference is that the source and destination cells have distinct Physical Cell IDs (PCIs); we reserve a few PCIs for destination cells from the set of reserved PCIs that operators maintain for future cell deployment <ref type="bibr">[41]</ref>.</p><p>Additionally, we configure the broadcast channels carrying the Master Information Block (MIB) and System Information Block (SIB) to use non-overlapping resources in the time domain (MIB and SIB are crucial information for UEs for cell detection, synchronization, and attachment). These configurations are required for successful handovers during proactive migration. They have no negative effect on the cell's performance. RIC integration. Our current prototype uses a custom-built controller, NFs, RPC channels and CU/DU hooks. However, we have taken care to align Atlas' design with the O-RAN RAN Intelligent Controller (RIC) specifications <ref type="bibr">[11]</ref>. The Atlas controller can be implemented as a RIC xApp. The NFs and CU/DU hooks can be implemented as service models running inside E2 Nodes <ref type="bibr">[10]</ref>, communicating with the RIC via the E2 interface, with only minor enhancements to the existing O-RAN specs (see &#167;6).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">PROACTIVE MIGRATION</head><p>Proactive DU migration in Atlas uses handovers (Figure <ref type="figure">3</ref>), achieved by serving two 5G cells-one for the source DU and one for the destination DU-via the same RU. The challenge here is simultaneously serving two 5G cells that UEs can successfully communicate with, via one RU shared between the two DUs. As already discussed in &#167;3, triggering and executing handovers takes time, depending on the RAN implementation, leading to a transient period where some UEs have moved to the destination DU while others still remain on the source DU. By sharing the RU between the DUs, we can prevent UE connection disruptions during this period. However, the current 5G vRAN stack is designed assuming a 1:1 DU-RU relationship. For example, the RU is configured with one destination fronthaul address (e.g., IP or Ethernet) of its peer DU, and it expects to receive an xRAN protocol-compliant stream of packets from the DU. Similarly, the DU's MAC scheduler and PHY layer expect signals to be sent/received via the entire RU, assuming exclusive use. We begin by describing two intuitive RU sharing approachestime sharing and frequency sharing-and discuss their limitations. We then present our insight of sharing RU antenna ports in combination with time sharing. Finally, we discuss how our RU sharing approach fits within the xRAN fronthaul protocol constraints, and by our requirement to be vendor-agnostic. We focus our discussion on the downlink, and briefly summarize the uplink direction in &#167;4.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">RU resources available for sharing</head><p>There are two intuitive RU resources that the source and destination DUs can share: time and frequency. We discuss below why sharing these resources does not work or is inefficient. As a piece of background, note that the DU's MAC scheduler makes scheduling decisions once every TTI (lasting for 1ms or less in 5G) <ref type="bibr">[53]</ref>, and over-the-air signal exchanges happen once every symbol. Each TTI duration contains multiple symbol duration (e.g., 14 in a typical 5G configuration). Time sharing. Is it possible to simply time-share the RU? For example, our fronthaul NF could allow source and destination DUs to communicate with the RU in alternating TTIs. This can be done, e.g., by dropping the source DU's downlink fronthaul packets in even-numbered TTIs, and the destination's in odd-numbered TTIs. However, this approach is infeasible because both DUs have to send crucial and time-sensitive control channel information for their Physical Downlink Control Channel (PDCCH) in every TTI. In the symbol that one DU sends PDCCH, we must drop either the other DU's PDCCH signals, or its user data signals; the latter could contain important RRC signaling. We implement this approach with multiple optimizations (e.g., using nonoverlapping PDCCH symbols for the two DUs) and find that UEs either cannot connect or get near-zero throughput. Frequency sharing. Can we split the frequency spectrum available to the RU between the source and destination DUs? For example, during migration, the source and destination DUs could use the lower and upper 50 MHz halves of a 100 MHz RU, respectively. Unlike time sharing, we find that frequency sharing is feasible through the 3GPP mechanism of Bandwidth Parts (BWP) <ref type="bibr">[37]</ref>. BWP allows dividing a single 5G carrier into multiple segments, each of which can be assigned to a different service or, in our case, to a different DU. Unfortunately, BWP is an optional 5G feature that some UEs and vRAN software may not support to minimize their complexity and cost, limiting the generality of this approach.</p><p>Furthermore, even in scenarios where all participants support BWPs, statically splitting the bandwidth between source and destination DUs could degrade user performance. During the start of the migration, more UEs are connected to the source DU, which thus has higher radio resource requirements compared to the destination DU. As UEs are gradually moved to the destination DU, the traffic load and the radio resource requirements also shift. Considering that the migration period can be non-negligible when a cell serves numerous UEs, such a capacity crunch is not acceptable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">RU sharing in Atlas</head><p>Given that the obvious resource-sharing approaches are not applicable, we investigate whether one of them can be enabled once we remove its limitations. We observe that time sharing is feasible if we solve the problem of control channels collocated over the same TTI. In Atlas, we solve this problem by observing that a third, less obvious resource can be shared between the two DUs in addition to time and frequency: the spatial resource, arising from the RU's multiple antenna ports. To simplify the discussion below, we use the terms "antenna ports", "antennas", and "spatial streams" interchangeably, though, in practice, their mapping is a complex process determined by the DU configuration <ref type="bibr">[36]</ref>. For each antenna port, the RU expects to receive (at most) one packet from the DU for every symbol duration. In the following, we explain how, by using the spatial domain, we can bypass the limitation of RU sharing for control data. With this problem solved, we present our solution for time-sharing the RU between the source and destination DUs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Antenna port sharing for control channel data.</head><p>We use the following RU sharing approach for control channel data in Atlas: When both DUs send fronthaul packets with control signals in the same symbol, Atlas splits the antenna ports and maps the signals to non-overlapping subsets of ports as illustrated in Figure <ref type="figure">4</ref>. The symbols carrying control signals are known a priori to Atlas, since they are statically configured for all cells at deployment time and source and destination cells have the same configuration.</p><p>Naturally, transmitting the control signals of both DUs at the same time (possibly with some spatial streams dropped) leads to destructive interference. Our choice to adopt this approach, despite the interference, is driven by the following observations: The downlink control channel carries two broad types of messages: i) data-related (e.g., UE scheduling decisions, HARQ feedback), and ii) non-data-related (e.g., paging, group UL power control). We observe that data-related control messages from the two DUs do not interfere because they implicitly fall under Atlas' user data time sharing, i.e., for downlink, they are sent at the same TTI as the corresponding data; for uplink, they are sent in a previous Atlas-controlled TTI. Non-data-related control messages do not fall under the same time-division scheme and can interfere, but they are i) relatively rare, and ii) highly robust (QPSK with high code rate, 24 CRC bits), because they target UEs with unknown signal quality. Given this, we expect interference for edge users to be minimal and not worse than the interference experienced by the existence of neighboring cells.</p><p>Antenna sharing can be implemented with near-zero overhead using simple forwarding rules, making it cost-efficient and easy to implement. It should be noted that to handle the DUs' inability to share spectrum for control channels, we also considered an alternative approach of combining the two DU's downlink fronthaul packets in our fronthaul NF, by summing up their IQ samples. However, we opted for our current solution, because, while the latter works, it adds latency overhead and cost, since it cannot be implemented inline in the hardware switch or the DU. Instead, it requires a software switch that, as we show in &#167;7.4, has an order of magnitude higher latency compared to the other two implementation approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Time sharing for user data and broadcast channels.</head><p>Unlike control channel data that is low-rate and designed to be robust, user data may use MIMO for high-performance transmissions, which is significantly more prone to interference. For example, a typical transmission of user data could use 4 layers MIMO with 256QAM, while control signal transmissions typically use diversity gain SISO with QPSK and high code rates. If we use the same spatial multiplexing technique as in the case of control channels, and we drop spatial streams of user data from the other DU, this can result in significant performance degradation. Furthermore, user data may be multiplexed with broadcast messages (MIB/SIB). If we opt to drop the broadcast data, UEs will lose synchronization and will be disconnected (or will fail to attach).</p><p>We overcome problem by using time-sharing for user data. Using dynamic RIC policies, we configure the MAC layer of the source and destination DUs to schedule user data in symbols of non-overlapping TTIs. Time sharing allows us to dynamically adjust the resource allocation on the fly to avoid capacity crunches during the migration period. Atlas collects telemetry data from the DUs (utilization of radio resources) and changes the TTI allocation ratio accordingly (see more details in &#167;4.4). Our fronthaul NFs work synchronously, ensuring that only one DU is allowed to send downlink data in all the symbols of each TTI. The scheduling policy is configured to guarantee that DUs will send downlink data whenever they have broadcast messages (configured in non-overlapping TTIs, as explained in &#167;3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">xRAN protocol-compatible RU sharing</head><p>We have described a high-level view of Atlas' RU sharing. We now show how to make our approach compatible with the standard xRAN fronthaul protocol. For simplicity, we omit some details such as xRAN "control plane" packets (distinct from the control channel signals discussed above).</p><p>In every symbol duration, the DU and RU exchange xRAN packets as follows. For every antenna port, the RU receives one downlink packet from the DU, and generates one uplink packet for the DU. Violating this causes undefined RU behavior, affecting user connectivity. To share the RU among two DUs while following the xRAN protocol, we need to carefully handle xRAN packets. We realize it as follows (Algorithm 1): Time alignment. The xRAN packet sequences of the two DUs must be time-aligned . For this, we time-synchronize the DUs with the RU using the Precision Time Protocol (PTP). Uplink packets. For any given symbol, each DU must receive an uplink packet from the RU. To enable it, Atlas duplicates and sends uplink packets to both DUs. This works because although a DU's PHY layer may receive signals meant for the other DU, this is no different from normal interference. The vRAN layers naturally filter out only the desired information via the MAC scheduling decisions and standard error checking such as forward error correction. Downlink packets. For any given symbol and antenna, the RU must receive only one downlink packet, either from the source or destination DU. As discussed in &#167;4.2, our fronthaul NF is aware of the type of signals (i.e., control or data) in each packet. The NF parses the downlink packet headers to extract the information, including the TTI and symbol number, and antenna ID (called extended Antenna Carrier (eAxC) <ref type="bibr">[9]</ref> in xRAN parlance). In our configuration, the source DU always uses antenna port 0 for its critical downlink signals (i.e., downlink control channel). During the migration process, the NF remaps the destination DU to use antenna 1 for its critical downlink signals by overwriting the antenna ID field in the packets from the destination DUs carrying such signals. This can be trivially extended for cell configurations with downlink signals using more than one antenna port.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Adaptive resource allocation</head><p>We further optimize our DU migration to improve resource utilization and minimize the impact on UEs. First, Atlas dynamically adjusts the time slots allocated to each DU, using telemetry collected from the DUs about the resource block (RB) utilization. The TTIs are allocated to the DUs proportionally based on their average RB utilization over a small window (e.g., 10ms). Moreover, when choosing UEs to migrate, Atlas prioritizes inactive UEs with low data rates and waits active UEs to complete their transmission. This ensures that the impact on UEs' activities will be minimized, e.g., due to interference on the downlink control channel or packets dropped during the handover.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">REACTIVE MIGRATION</head><p>Atlas handles unplanned resilience events (i.e., DU software or hardware failures) differently from proactive migration, for two reasons. First, proactive migration requires the original DU to participate in handovers, which is not possible when it has crashed. Second, failures are much less frequent than planned events, so the RAN can afford a short downtime, e.g., comparable to handover failures that UEs experience occasionally during normal operation. Reactive migration &#8764; handover failure recovery. Recall that our approach in Atlas is to use existing mechanisms for wireless resilience. For reactive migration (i.e., failover), we use the mechanism that UEs use to recover from handover failures. When a primary DU fails, Atlas quickly pairs the RU with a backup DU on a different server. Since today's O-RAN RUs lack the functionality to quickly re-route the fronthaul traffic to a different DU, we use the fronthaul NF for this purpose.</p><p>To the UE, this appears as if it has lost the signal from one cell and entered the coverage area of a different cell. This is identical to a phenomenon called "too-late handover" that happens, for example, when a mobile UE turns a sharp corner in a city. Since the change in coverage happens abruptly, the network has no time to smoothly handover the UE between cells, requiring the UE to reconnect to the new cell.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">High-level approach</head><p>To tolerate the failure of a DU server in an edge datacenter with N primary DUs (one per server), Atlas maintains a shared backup DU on a separate server that can take over if any of the N DUs fail. Since it takes a long time (over 40 s in our case, see &#167;7) to bring up a new DU from scratch, the  backup DU is active and ready to take over. During failurefree operation, the backup DU performs no work (e.g., PHY signal processing or MAC scheduling), so it can run in a low power mode.</p><p>When a primary DU fails, Atlas' fronthaul NF starts forwarding fronthaul traffic for the RUs associated with the failed DU to the backup DU. This brings up the cell served by the failed DU back online before affected UEs begin a process called "cell reselection" to regain connectivity. The UE notices that its wireless signal to the primary DU is not working when it repeatedly fails to decode downlink signals it expects from the primary DU. After the UE's Radio Link Failure (RLF) timer expires, the UE searches for the best available cell and tries to reconnect to it. Since Atlas' backup DU is online at this time with the same signal strength as UE's previous cell choice, the UE chooses the backup DU's cell.</p><p>Comparison with handover failure recovery. Our key finding is that this approach provides a downtime comparable to the UE's time to recover from handover failures. We argue that Atlas' reactive migration downtime is therefore acceptable since handover failures are far more likely than DU failures. We expect DU failures to happen at most a few times per year. In comparison, Li et al. <ref type="bibr">[34]</ref> report that walking UEs in a commercial LTE network experience handovers once every 70 s; approximately 1 % of handovers fail, resulting in handover failures every ten minutes. Li et al. <ref type="bibr">[33]</ref> report handovers every 50 s while driving in a 5G network, off which 1.1 % fail as too-late handovers.</p><p>The Radio Link Failure timeout is network-configurable. In our testbed, we set it to 5G's lowest possible value (250 ms) to emulate reliable 5G networks without coverage holes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Lack of DU failure awareness in vRAN</head><p>The main challenge in making failovers quick is that existing vRAN protocols, specifications, and software implementations were not designed with DU failures in mind. We believe that this is because historically, RANs have used specialized  hardened appliances that are less prone to failures. While one would expect that simply re-routing the RU's fronthaul link to the backup DU would suffice, we find that this approach results in (1) an initial downtime lasting over 3 s, and (2) complete disconnection of UEs after some time. Figure <ref type="figure">6</ref> shows these disruptions for a downlink TCP transfer with one UE in our testbed ( &#167;7). We discuss the causes below.</p><p>For background, the control plane protocol between the CU and DU is called the "F1 Application Protocol" (F1AP) <ref type="bibr">[2]</ref>, which uses the Stream Control Transmission Protocol (SCTP) as its underlying transport. Figure <ref type="figure">5</ref> shows a UE trying to reattach via the backup DU, using an RRC message sequence. The CU treats this a normal cell reselection attempt. Problem 1. Since the CU is unaware of the primary DU's failure, on receiving the UE's re-attachment request via the backup DU, it sends an F1AP request the primary DU to release the UE's connection. The primary DU cannot respond to this request since it has crashed, but the CU keeps waiting for a response. Progress can be made only after the 5G core (Figure <ref type="figure">1</ref>) triggers a timeout (3 s in our case), and allows the CU to establish the UE's connection on the backup DU without releasing it on the primary DU. Problem 2. The CU later learns about the primary DU's failure when their F1AP session's underlying SCTP connection times out, which takes 30 s in our case. The F1AP protocol <ref type="bibr">[2]</ref> does not clarify how the CU should react in this case. In the vRAN implementation that we use, the CU handles the primary DU's SCTP timeout by deleting all UEs that were attached to the primary DU. This disconnects any of these UEs that had successfully reconnected via the backup DU.</p><p>Before describing Atlas' failover handling in detail, we discuss the limitations of two straightforward approaches to tackle the above problems. F1AP reset messages. Although F1AP supports "reset" messages allowing the DU to notify the CU of failures, these seem designed for only partial failures. For example, management threads in a DU with irrecoverable errors limited to its MAC layer could send such messages. A full DU failure (e.g., due to a server crash) does not allow the DU to send such messages. Timeout tuning. While it may be possible to tune the 5G core or SCTP timeouts to reduce the downtime, doing so would affect failure-free operations which we wish to avoid. Lowering timeouts to hundreds of milliseconds is dangerous since such low timeouts can be triggered even during normal operation, such as when the CU server or a backhaul network is overloaded. A single CU or 5G core may serve thousands of remote DUs at various cell sites with different latency and bandwidth congestion, and we do not want to tune the timeout for each DU. (Even intra-datacenter transports employ retransmission timeouts of hundreds of milliseconds, and connections time-out only after tens of retransmissions).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Adding DU failure awareness to vRAN</head><p>To overcome the above limitations, we design a midhaul network function and a failover manager running in our controller (Figure <ref type="figure">2</ref>), which provide low-latency notification of DU failures to the CU. The NF interposes on the SCTP protocol between the CU and the DUs. For each DU, the NF creates separate SCTP connections with the CU and the DU, and transparently relays messages between the two during failure-free operation. When the primary DU fails, the failover manager detects the failure by noticing the lack of heartbeats from the DU, and notifies the midhaul NF.</p><p>How can our midhaul NF quickly indicate the DU failure to the CU? One could consider sending an F1AP reset message to the CU. However, since F1AP reset messages must contain the list of UEs affected by the failure, our NF would need to decode all F1AP messages, which is infeasible in cases where F1AP messages are encrypted. Instead, our NF simply closes its SCTP connection to the CU that it created for the primary DU, which quickly signals the CU that the DU has failed. The CU then releases all UEs associated with the primary DU. This happens before the UEs try to reattach, avoiding the two problems described above.</p><p>Since the midhaul NF interposes on only the delay-tolerant low-bandwidth CU-DU control plane (roughly 10 Kbps per UE), it can scale to a large number of DUs and UEs. In addition, since this NF does not handle realtime data, existing NF resilience techniques (e.g., ECHO <ref type="bibr">[40]</ref>) can make it resilient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Failover steps</head><p>To complete the picture, we break down Atlas' failover process. For simplicity, we use average numbers for the various phases, measured with one UE in our testbed ( &#167;7).</p><p>(1) T = 0 ms: The primary DU crashes.</p><p>(2) T = 50 ms: Our failure manager notices the lack of fronthaul packets from the primary DU, and notifies the fronthaul and midhaul NFs. The former re-routes the affected RU's fronthaul to the backup DU, and notifies the midhaul NF of the primary DU's failure.  Importantly, it takes only around 700 -600 = 100 ms for the vRAN to re-establish the UE's connection after the RLF timeout. This is because much of the UE's state elements, including its connection contexts and authentication keys, do not need to be reconstructed since they are maintained in the higher layers of the protocol stack (i.e., the CU and core network). During reactive migration, these get reused to quickly restart the UE's connection at the backup DU. For comparison, it takes approximately 950 ms instead of 100 ms for the UE to connect from scratch after the vRAN receives the UE's RRC setup message.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">IMPLEMENTATION 6.1 Atlas network functions</head><p>Atlas fronthaul NF. To demonstrate the generality and portability of our design, we implement the Atlas fronthaul NF on three different targets. First, we implement it as a software middlebox for x86 CPUs in &#8776; 4500 lines of C++, using DPDK <ref type="bibr">[25]</ref> for low latency.</p><p>Second, we implement it in a distributed manner on each DU using the Janus RAN programmability framework <ref type="bibr">[22]</ref>, which allows the introduction of hooks in the RAN CU/DU functions, for injecting sandboxed userspace eBPF code <ref type="bibr">[18,</ref><ref type="bibr">27]</ref>. Using Janus, we introduce a hook at the xRAN layer of our PHY software <ref type="bibr">(Intel FlexRAN [26]</ref>), and load an eBPF code (&#8776; 500 lines of C), which intercepts the downlink xRAN packets and implements the steering decisions described in &#167;4.3. In addition, our top-of-rack (ToR) switch mirrors uplink packets received from the RU to both DUs using the traffic mirroring feature supported by commodity Ethernet switches (e.g., port mirroring in Arista EOS <ref type="bibr">[23]</ref>).</p><p>Third, we implement it for an Intel Tofino switching ASIC in &#8776; 600 lines of P4-16 <ref type="bibr">[43]</ref>, and deploy it on a programmable ToR switch. We find that the fronthaul NF logic can be implemented using only a single match-action table by effectively leveraging the range-matching feature. This implementation can process the fronthaul traffic at line rate (e.g., 6.5 Tbps using a 32-port switch). Midhaul control plane NF. The Atlas midhaul NF is a lightweight application running on the same server as the CU. It communicates with the CU-CP and DU control planes using Linux SCTP sockets. During normal operation, it simply forwards CU-DU F1AP messages. Upon receiving a DU failure notification from the Atlas controller, it notifies the CU of this failure by closing the corresponding SCTP socket.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Atlas controller</head><p>The Atlas controller orchestrates migration by managing the fronthaul and midhaul NFs via their corresponding control APIs, e.g., Barefoot Runtime RPC APIs for a P4-based fronthaul NF, and via gRPC with our midhaul NF. It also interacts with the CU and DU using Janus control and telemetry hooks that have been integrated in the commercial-grade CU and DU software of Capgemini <ref type="bibr">[19]</ref>.</p><p>Handover and resource manager. First, the handover manager uses a non-realtime hook in the CU to trigger handovers. Second, our resource manager uses a hook in the DU's MAC scheduler to implement RU time sharing for user data. We also use telemetry information from the CU's RRC layer to prioritize UEs during migration. Failover manager. The failover manager uses DU telemetry probes in the PHY layer to detect hardware and software crashes. The probe sends heartbeats to our controller in every 500 &#181;s TTI duration while the DU is alive. If heartbeats are not received for a timeout duration (set conservatively to 20 ms), our controller notifies the fronthaul and midhaul NFs to trigger DU failover.</p><p>It should be noted that our current implementation relies on Janus for realizing the CU/DU hooks, due to the flexibility that it offers for changing the control and telemetry logic on-the-fly. However, the same functionality could also be implemented using existing O-RAN E2 service models and 3GPP specifications, with minor enhancements. Specifically, the handover functionality could be implemented using the traffic steering E2 service model <ref type="bibr">[31]</ref> and the required telemetry information could be obtained using the KPM <ref type="bibr">[12]</ref> and Network Information <ref type="bibr">[8]</ref> service models. Finally, the per-TTI scheduling of the radio resources between the two DUs could be achieved using the 3GPP slicing parameters exposed in <ref type="bibr">[1]</ref>, using a service model similar to the one described in <ref type="bibr">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">EVALUATION</head><p>Testbed hardware. We evaluate Atlas on a commercialgrade 5G vRAN testbed (Figure <ref type="figure">7</ref>). Our testbed uses ceilingmounted O-RAN RUs from FoxConn, with 100 MHz 4x4 MIMO operating at 3.5 GHz (Figure <ref type="figure">7a</ref>). We use three HPE Telco DL110 servers, each with one Intel Xeon 6338N CPU, an Intel E810 100 GbE NIC, and an Intel ACC100 accelerator for PHY forward error correction. The RU and servers are connected via a 100 GbE Arista 7170 Tofino-based P4 switch, and synchronized with a Qulsar QG2 PTP grandmaster clock. There are five UEs, which are a mix of Raspberry Pis with Quectel RM502Q-AE modems, and OnePlus Nord N10 5G phones (Figure <ref type="figure">7b</ref>). Software. We use Intel FlexRAN v22.03 for the PHY, CapGemini's 5G stack for the higher DU and CU layers, and Metaswitch's Fusion 5G Core. The servers run real-time Linux kernel v5.15. We run the primary and secondary DUs and the CU on different servers. For proactive migration, we start the destination DU only when needed. For reactive failovers, we keep a backup DU always running as a hot standby. Cell configuration. We use the two RUs and five UEs in our building to emulate a realistic deployment. We deploy two cells with PCIs 11 and 51, named "cell 11" and "cell 51", respectively. Figure <ref type="figure">8</ref> illustrates the radio signal coverage and UE placement in our testbed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Comparison with baselines</head><p>We use one UE in this section for clarity, and show results with multiple UEs in later sections. In the experiment, we use iperf to send downlink TCP traffic with a target rate of 40 Mbps and measure the throughput once every second. We compare Atlas with three baselines for vRAN resilience:</p><p>(1) Neighbor cell offload. Although offloading UEs to a neighboring cell may not always be possible due to DU centralization ( &#167;2.4), we evaluate it for completeness. We offload the UE from cell 11 (served by DU 1) to cell 51 (served by DU 2) as follows. For proactive migration, we use a CU-triggered handover. For failover, we kill DU 1 and let the UE re-attach to DU 2. (2) Stateless migration to a hot-standby DU. We use only one RU (cell 11) in this case. Both DUs connect to Atlas' fronthaul NF, which initially forwards traffic between RU 1 and DU 1. For proactive migration, we steer the fronthaul traffic from the RU to DU 2. For failover, we kill DU 1 and then immediately steer the fronthaul traffic to DU 2.  (3) Creating a new DU instance. Only cell 11 is used.</p><p>For both proactive and reactive migration, we kill DU 1, and immediately launch a new DU on the same server.</p><p>Figure <ref type="figure">9</ref> shows the results. For proactive migration, Atlas seamlessly moves the UE to the new DU with no visible connectivity disruption, other than minor TCP rate variations that also happen in normal operation. In comparison, offloading to the neighbor cell results in a persistent throughput drop of around 25%, because RU 2 is farther and has worse signal quality than RU 1 for the UE (UE #5). The other approaches-stateless migration to a hot standby DU, and creating a new DU instance-result in complete disconnection for around 3 s and 45 s, respectively.</p><p>For reactive migration during failover, Atlas maintains non-zero TCP throughput for every second interval. With the neighbor cell offload and stateless migration approaches, we observe UE disconnections, shown by the red arrows in Figure <ref type="figure">9b</ref>. As discussed in &#167;5.2, the UE completely disconnects from the network at around 30-40 s after failure, caused by the CU deleting the primary DU's UEs when their SCTP connection times out. Results for the "new DU instance" approach are identical to those for proactive migration, and are omitted for brevity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Multi-UE performance</head><p>We next evaluate Atlas' performance using five UEs. TCP and UDP throughput. We measure the five UEs' UL (uplink) and DL (downlink) throughput changes during DU  migration when they all concurrently transmit TCP or UDP traffic. All UEs initially connect to cell 11, and run iperf with a target rate of 8 Mbps for DL and 2 Mbps for UL. Figure <ref type="figure">10</ref> shows the average throughput of all UEs during the migration, with the minimum/maximum throughput across UEs. We make the following observations. Overall, Atlas' migration techniques have little impact on the UEs' connectivity and throughput. For proactive migration, both the TCP and UDP throughput remain largely unaffected. There are small throughput variations caused by the migration event, but these are comparable to natural throughput variations caused by the noisy nature of wireless channels.</p><p>For failover, the TCP throughput drops to zero, but for only 600 ms and 850 ms for DL and UL, respectively. As discussed in &#167;5.1, this is comparable to the recovery timeout from handover failures (e.g., 670 ms in an LTE study by Li et al. <ref type="bibr">[35]</ref>). After failover, the throughput returns to normal, with high variation caused by TCP buffering lasting approximately 5 s. The UDP throughput drops to zero for only 750 ms. Ping latency. Figure <ref type="figure">11</ref> shows the average RTT across the five UEs to a remote server during migration, measured every 200 ms. Proactive migration has no visible impact on ping latency. For failover, the RTT increases to at most 650 ms for less than one second and then returns to normal values. Adaptive time slot allocation. We evaluate the effectiveness of Atlas' adaptive resource allocation ( &#167;4.4) by comparing it with a static allocation scheme that statically partitions TTIs evenly between the source and destination DU. In this experiment, we send 3 Mbps UL TCP traffic from each of the 5 UEs (15 Mbps overall), while the total cell capacity is 20 Mbps. Figure <ref type="figure">12</ref> shows that with static resource allocation, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Real-world applications</head><p>We use live video streaming as an example to evaluate how well Atlas preserves connectivity for low-latency applications. Our experiments show that Atlas migrates the DU with negligible impact on video streaming. Our experiment uses the DASH.js framework, which plays low-latency streams and reports live latency, video bitrates, and video buffer sizes <ref type="bibr">[21]</ref>. We set the target latency to 3 s, with which the DASH player tries to stream the video while staying at most 3 s behind the live event. We use framework-recommended values for other settings, resulting in a 6 Mbps video bitrate. Figure <ref type="figure">13</ref> presents the lag behind the live event, and the server's buffer size during proactive migration and failover. We find no significant change in either metric. The video quality and fluency are stable during the migration, as the video buffer is never fully drained.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4">Microbenchmarks</head><p>Migration time. We measure the time that Atlas needs to complete DU migration with different numbers of UEs. Overall, Atlas can complete the migration within a few seconds. For proactive migration, Atlas takes 0.5 s for one UE, which increases linearly with the number of UEs to 1.3 s for five UEs. Much of this time comes from a limitation of the DU implementation used in our experiments, which allows triggering only one handover every 0.5 seconds. This time will decrease by an order of magnitude with more optimized handover triggers (e.g., multiple simultaneous handovers) since handovers take &lt;100 ms to complete. A typical number of users that are RRC-connected to a cell is 1-25 <ref type="bibr">[20]</ref>, so all UEs will migrate in a few seconds. In addition, each UE retains connectivity for most of this duration since the over-the-air actual handover signaling lasts for only around 20-50 ms. During failovers, the time to re-attach one and five UEs is 0.7 s and 0.9 s, respectively. CPU and latency overhead. Atlas' components add little CPU overhead and latency. Our eBPF hooks in the PHY, MAC, and RRC layer add at most 2 &#181;s per TTI. This is negligible for the non-realtime RRC layer, and only 0.4 % of the PHY and MAC's 500 &#181;s TTI budget.</p><p>The overhead of our fronthaul NF depends on the implementation used. The P4 switch-based implementation adds no additional latency beyond the switch's 800 ns port-toport forwarding latency, which is negligible compared to the xRAN fronthaul's 100 &#181;s latency budget. If implemented in a software switch, the NF adds up to 10 &#181;s. A distributed implementation based on eBPF hooks in the PHY also adds negligible latency (0.8 &#181;s).</p><p>We have not optimize our midhaul NF for performance yet, because it interposes on the delay-tolerant control plane between the CU and DU, which carries little traffic (e.g., compared to the data plane). Our Linux sockets-based implementation adds under 100 &#181;s per message.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">RELATED WORK</head><p>Supporting resilience in cellular networks. Existing literature studies offloading UEs to neighboring cells to make cellular networks more resilient <ref type="bibr">[45,</ref><ref type="bibr">51,</ref><ref type="bibr">52]</ref>. Yang et al. <ref type="bibr">[52]</ref> design a way to estimate the service impact on UEs with hypothetical cellular-tower outages when multiple cells exist around the UEs. Concord <ref type="bibr">[45]</ref> and Magus <ref type="bibr">[51]</ref> propose solutions to minimize the disruption due to network upgrades by carefully coordinating neighboring cells. These approaches are constrained by traditional RAN deployments having only one DU per cell site, with no backup DUs for the cell site's radio. In contrast, Atlas develops the necessary techniques to use the multiple servers available to handle a cell site in vRAN deployments. The result is a design that works without needing inter-cell coordination or coverage from neighboring cells, and has less impact on UE performance since it does not affect signal quality.</p><p>Another line of work replicates the computation state of cellular network functions for high availability <ref type="bibr">[29,</ref><ref type="bibr">40]</ref>. For example, ECHO <ref type="bibr">[40]</ref> replicates the state of an LTE core network to an external key-value store. We believe that it is infeasible to apply such approaches to DUs due to their stricter real-time requirements. Slingshot <ref type="bibr">[32]</ref> migrates the DU's stateless PHY processing to another server, but it does not handle the DU's higher stateful layers. RU sharing/virtualization. Picasso <ref type="bibr">[24]</ref> proposes a new RF front-end design that enables simultaneous transmission on arbitrary spectrum fragments with a single RF front end and antenna. Although such features would be useful for vRAN resilience, they are not available in current commercial RU designs. HyDRA <ref type="bibr">[30]</ref>, SVL <ref type="bibr">[50]</ref>, and Mendes et al. <ref type="bibr">[39]</ref> multiplex the RF front-end in the frequency domain, based on FFT or filterbank, creating virtual RF front-ends using isolated spectrum bands. However, more work is needed to realize these techniques in real vRANs setups. NRflex <ref type="bibr">[16]</ref> uses Bandwidth Parts (BWP) for slicing radio resources, which is an optional feature. In contrast, Atlas can be implemented in the existing vRANs stack without requiring such features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9">CONCLUSIONS</head><p>We have shown how Atlas solves a key problem in the journey toward resilient vRANs: making the real-time black-box DU upgradeable and fault-tolerant. The key insight is to repurpose existing cellular resilience mechanisms for software resilience. To accomplish this, we develop a novel method for sharing an RU between two DUs by exploiting the spatial antenna dimension; and we address limitations in existing 5G protocols arising from their failure-agnostic nature. Experiments in a commercial-grade 5G vRAN testbed show that Atlas handles DU upgrades and failures with little connectivity disruption, and requires few vRAN modifications. We believe that these techniques form a practical basis for future resilient vRANs.</p></div></body>
		</text>
</TEI>
