<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>GranularNF: Granular Decomposition of Stateful NFV at 100 Gbps Line Speed and Beyond</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>08/30/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10356664</idno>
					<idno type="doi">10.1145/3561074.3561092</idno>
					<title level='j'>ACM SIGMETRICS Performance Evaluation Review</title>
<idno>0163-5999</idno>
<biblScope unit="volume">50</biblScope>
<biblScope unit="issue">2</biblScope>					

					<author>Ziyan Wu</author><author>Tianming Cui</author><author>Arvind Narayanan</author><author>Yang Zhang</author><author>Kangjie Lu</author><author>Antonia Zhai</author><author>Zhi-Li Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this paper, we consider the challenges that arise from the need to scale virtualized network functions (VNFs) at 100 Gbps line speed and beyond. Traditional VNF designs are monolithic in state management and scheduling: internally maintaining all states and operations associated with them. Without proper design considerations, it suffers from limitations when scaling at 100 Gbps link speed and beyond: the inability of efficient utilization of the cache because of the contention due to the frequent control plane activities, computational/memory-intensive tasks taking up CPU times, shares states causing the synchronization among the cores.            We address these limitations by arguing for the need to granularly decompose a VNF into data/control components that are co-located within a server but can be independently scaled among the cores. To realize the approach, we design a "serverless" programming framework with novel abstraction to optimize the data components that must process packets at the line speed, reduce the contention of the data states and enable run-time scheduling of different components for improved resource utilization. The abstractions, combined with the runtime system that we design, help NFV developers focus on the logic and correctness of VNF programming without worrying about how VNFs may be scaled in or out. We evaluate our platform by comparing it with monolithic approaches using different workloads and by analyzing its advantages of separation on scalability, performance determinism, and feature velocity.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Unlike traditional data centers, edge clouds have limited real estate (e.g., rack space): packing more networking, e.g., with 100 Gbps network interface cards (NICs) and more cores per server is crucial. Network Function Virtualization (NFV) systems, e.g., implementing 5G core network functions, running in edge clouds therefore must be capable of processing packets in software at 100 Gbps line speed and beyond.</p><p>VNF designs (see, e.g., <ref type="bibr">[9,</ref><ref type="bibr">17,</ref><ref type="bibr">18]</ref>) often apply the SBA (server-based architecture), where it decouples the high-level policymaking into a separate network of network functions using the idea of CUPS (Control and User Plane Separation) and lets the data network functions communicate with the The architecture improves the flexibility of NF software but the principle is not sufficient addressing the issue of scaling stateful network function within a multi-core server. Existing programming abstractions <ref type="bibr">[6,</ref><ref type="bibr">8,</ref><ref type="bibr">15]</ref> focus on the reusability of packet processing modules. They compose different building blocks or elements into more complex network functions and states, exploiting the reusability of different modules to increase performance (Figure <ref type="figure">1</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>left).</head><p>While previous works show considerable improvement combined with a fast packet processing framework [4], there are limitations when achieving line speed. Without additional design considerations, it suffers from limitations when scaling at 100 Gbps link speed and beyond. First, the utilization of the cache often suffers from interference because they do not differentiate states/actions with different properties. As a result, different states often contend with each other whereas the states that highly impact the performance do not reside in the cache. Second, there is a lack of proper scheduling of routines that perform computational/memoryintensive tasks. Scaling out often treats NF as a whole with coarse resource allocation, regardless of the real bottlenecks.</p><p>For the efficient execution of a single stateful network function on an NFV system, we argue that in order to scale them effectively on a multi-core system, we need to decompose it into components with different properties. The granularly decomposed NF (Figure <ref type="figure">1 right</ref>) is an organization of independently schedulable data components (data actions and data states), control components (control actions and control states), and offloading components, which communicate with Compared to the monolithic NFs, the benefits of granularly decomposed ones not only inherit the benefits from a serverless framework such as independent scaling, flexible placement, increased velocity of innovation of the data components and control components, but also decouple the data states from the control states, making the states of the data components of the network functions fit into the L1/L2/Last level cache, which is essential for the network functions to scale 100Gbps <ref type="bibr">[25]</ref> beyond on COTS (Commercial offthe-shelf) hardware considering the per-packet processing budget is under 6.7 ns.</p><p>While the granular decomposition is arguably the right abstraction, realizing it practically is challenging due to: (1) Keeping the cache footprint small is hard when the number of active flows is high. For 100Gbps traffic and beyond, the number of flows is generally high which leads to a large flow table that exceeds the capacity of the cache. (2) The activities of the control/offloading components will negatively impact the performance of the data components. (3) Shared states among the cores will negatively impact the performance.</p><p>We observe an opportunity to solve those challenges by (1) a smaller fast table for storing per-flow states. This is driven by the insight that prioritizing flows with the higher occurrence or higher priority for efficient utilization of the private cache of the flow. (2) a runtime system that supports flexible placement of the data components and control/offloading components, placing them on different cores when those cause contention, cache allocation technology to limit the resources usages of actions with lower priority. (3) a mechanism that delays the synchronization of shared states, driven by the insight that shared states do not need updates on a per-packet basis.</p><p>We present GranularNF (Section 4), a micro-architectureaware serverless network platform/abstraction that supports the aforementioned programming framework on the COTS hardware. GranularNF consists of a director as the control plane and the runtime as the data plane of the system. It provides an interface for the application developers to define data/control/offloading action/states without worrying about the detail of scheduling and the burdensome procedures of upgrading components. We illustrate the benefit (section 5) of the system using a load balancer, IDS, and 5G UPF as examples. We have shown that (1) the granular decomposition can improve the system throughput by better utilization of the cache. (2) the system inherits the benefits of SBA design: horizontally scalable, and velocity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">MOTIVATION</head><p>We argue that the existing monolithic abstraction of network functions is not sufficient. Scaling stateful NFs imposes serious challenges due to (i) states that cannot fit into the private cache of a core lower the performance (ii) shared states among the NF instances result in the diminishing return in scaling (iii) (high frequency of) calling of routines that have computational complexity impacts the data plane performance. All these three factors will be influenced by the dynamic traffic patterns during the run time. We use three common NFs to illustrate the challenges mentioned.</p><p>The first example is a stateful load Balancer, and we measure its single-core performance. NF instances with a high number of flows (large working set) result in bad performance. As illustrated in figure <ref type="figure">2</ref>, when the number of flows grows, and the state can no longer fit in the cache, the L1 miss rate increases, and the utilization of the L2 and L3 cache significantly degrades. For the optimal placement of NF instances, it is essential to be aware of the size of the working set of the corresponding instance.</p><p>To illustrate the performance bottleneck caused by shared states, consider the traffic accounting of a UPF with shared states among network instances. For per-packet network monitoring, shared states with high accessing frequency will impact performance (figure <ref type="figure">3</ref>). The higher frequency of access to the shared state is, the more performance degradation. The cause is the contention of the access to the data structure as well as the contention in the LLC. For those states evicted from the LLC, the associated operations become memorybound <ref type="bibr">[26]</ref>.</p><p>Taking IDS as an example, higher network traffic can causes non-negligible computation overhead. In the example, we send evenly balanced traffic to each core but with a different of suspicious flows. Instances receiving the suspicious traffic will go through a series of calculations (pattern matching) to make decisions on whether or not to blacklist the flows. As illustrated in the figure <ref type="figure">4</ref>, the overhead is unevenly distributed on each instance even though the throughput of the input traffic is the same.</p><p>These examples show how the performance of the NF can be influenced by different characteristics of the components. These behaviors motivate us to break down monolithic NFs to give the right abstractions to describe their behavior. Next, we show how we can describe example stateful NFs using our abstractions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">PROGRAMMING EXAMPLES 3.1 Load Balancer</head><p>The main function of a load balancer (figure <ref type="figure">5 left</ref>) is to map and rewrite m (public/virtual) destination IP to n (private/physical) IP addresses (e.g., those of n backend servers). The flow mapper is a data action that will map the packet destination IP address to a new server address. The flow mapper contains a classifier that will differentiate whether a flow is a new flow. If it is an old flow, it will rewrite the header based on the mapping. If it is a new flow, it will send the messages to the server selector which is a control component to select the new destination address according to the policy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">IDS</head><p>For an IDS (figure <ref type="figure">5</ref> middle), it needs to inspect the content of the traffic to decide whether to blacklist the flow. The data state consists of a white list of flows and a black list of flows and a list of inspection rules. The white-list of flows allows the incoming flows to be forwarded without further inspections. The black-listed flows are blocked. Besides the matching process, the regular expression matching engines will use the inspection rules to examine the payload of the packets. The policy such as the list of inspection rules can be configured by the remote policy server.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">5G UPF</head><p>For a 5G UPF (figure <ref type="figure">5</ref> right), it acts as a router between the Access Network and Data Network. The data actions will handle uplink and downlink traffic. It will read the data states to map a part of the 5-tuple to the PDR (Packet Detection rule) which defines how to handle the packets (forwarding, buffering, dropping, accounting). The policy of the rules is defined as installed by the SMF (Session Management Function). The control actions handle the request from the SMF and the control states stores the PFCP (Packet Forwarding Control Protocol) session states. The buffer server contains a memory pool, storing the downlink packets when the user devices are idle. While the forwarding table part of the data states can be duplicated among the cores, the accounting part is shared among the cores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">SYSTEM DESIGN</head><p>We schematically depict the overall system architecture in Fig. <ref type="figure">6</ref>. The system has two subsystems: the director and the runtime. They communicate with each other through TCP. The director sends commands to configure/control the runtime, and the runtime is responsible for executing network functions.</p><p>Director (Fig 6 <ref type="figure">left</ref>) is the control plane of the system, responsible for the logical representation of NF, resources management of the physical machines underlying the runtime, setting up deploying plan, analyzing metrics/logs collected from the run time to support auto-scaling, placement. NF model is a registry of all the available NFs, with the information on the representation (granular decomposition) of an NF, the properties of actions and states, and the dependencies among the actions. Resource Management Module contains information about how many cores the physical machine has, cache size, and memory size. The goal of the runtime model is to characterize the current performance using metrics such as the throughput of each worker and micro-architecture metrics such as L1/L2/L3 cache hit rate. The PCM module [3] on the runtime will send metrics collected to the director periodically. The orchestration layer is where we implement our agent for orchestration tasks (auto-scaling, auto-placement). This module aims to provide an automated solution for reducing the space of scaling and placement strategies for optimal performance and efficiency.</p><p>The goal of the runtime (Fig 6 <ref type="figure">right</ref>) is to provide environments for the NFs to execute. More specifically, it is responsible for fast packet receiving/sending, scheduling different actions to execute and route the messages to them correctly. The worker has exclusive resources (CPU times, cache, NIC queues, RSS buckets) of a core. A worker also has a mailbox implemented by a ring structure to send messages to other workers. When a worker gets the next action, if the core id of the action is on the other core, the worker will put the message into the mailbox and the destination worker can receive it using its mailbox. Data State Management: It is hard to fit the state into the cache because the number of flows is high hence a larger flow table. But the performance can be improved if we exploit the spatial locality of the flow table and the temporal locality of the arrived flows. For each flow table in the data state, we maintain a fast table and a slow table. The fast table stores the most frequently accessed flows and it is more compact to leverage the spatial locality. The slow table stores all the mapping of the flows. The runtime will periodically update the fast table based on the cache miss rate provided by the profiling modules and the arrival pattern for the flows.</p><p>For the shared state management, we make a trade-off between the freshness of the shared states and the performance. Instead of updating the shared state on a per-packet basis, we let each core running data actions keep its own private state. The coordinators will periodically do a merging on the private states of each data action. By sacrificing the freshness of the shared states, we avoid the downside of frequent synchronization of different cores to access the shared states. Minimizing Interference on the data components: While the control components can negatively impact the performance of the data components, the performance requirement of the control components is much lower compared to the data plane, which means that the performance of the control components will not benefit much from the cache and frequent message exchanging. The overhead of frequent message passing can be reduced through batch processing. The CAT (Cache Allocation Technology) of the Intel CPU feature set can help to limit the number of ways that the control components can use to decrease the interference among the data components and control components. For the offloading components, flexible placement of them to other cores/sockets can relieve the bottleneck on the data plane. Sharding: By sharding, the state of the network functions is partitioned. To achieve better load balancing among each instance, we set the number of shards greater than the number of cores. For better resource utilization, each shard can be freely migrated from one core in the runtime to the other core by configuring the RSS buckets dynamically. Implementation: We implement the platform using DPDK for fast packet processing. We implement the load balancer from scratch. For the IDS, we use the code from tiny-regexc <ref type="bibr">[5]</ref> to do the regular expression matching for the payload. For the 5G UPF, we implement a version by extracting the dataplane components of the PDR matching, encapsulation, decapsulation, routing from gtp5g <ref type="bibr">[1]</ref>, the logic of the buffer server, and PFCP request handlers from free5gc [2]. Compared to the kernel module that the original implementation, the kernel-bypass techniques that we apply will also reduce the data plane and control plane interference by eliminating the cost of context switching of the user space and kernel space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EVALUATION</head><p>Experiment setup: To evaluate our platform, we use two machines with Intel(R) Xeon(R) Platinum 8168 CPU which has 48 cores on two sockets, and 2 ConnectX-5Ex with 100 Gbps. One machine is running a traffic generator to generate 100 Gbps traffic sent to the second machine which runs our runtime. The director is placed on the traffic generator.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Data Components Optimization</head><p>Improvement of the fast table : We assume that the traffic that is generated contains 50K flows that arrive frequently (90% of the overall traffic) compared to the rest of the flows. For the decomposed network functions, we put the control components on different cores. For all the network functions, we generate simulated policy configuration requests for the control components. Figure <ref type="figure">7</ref> compares the performance of the decomposed load balancers to the monolithic ones using 16 cores. The comparison shows the decomposed one is better than the monolithic one because the accesses of the data states and the control states are isolated on different cores. The Fast table version with 5k entries per core can achieve better performance because it consolidates the most frequently accessed flows into a smaller table that can better fit into the cache.</p><p>Improvement of the delaying synchronizing: we now evaluate how delaying synchronizing the shared state can improve the performance. We assume that an IP flow is distributed to different cores using RSS. We compare the performance of the 5G UPF with/ without shared states optimization (figure <ref type="figure">8</ref>). It is shown that with strict per-packet updates of the shared states, the scalability suffers. In comparison, delayed synchronization of the states of different cores can achieve linear scalability. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Velocity</head><p>We play a workload that required a deployment scheme that places the control component dynamically on a separate core. Figure <ref type="figure">9</ref> shows that our design can support the flexible dynamic placement of each component within a server. More important than the speed of the transition, which is largely attributable to the low-latency nature of the message, is the fact that traffic did not get dropped or need to be steered away from this data plane actor during the transition.</p><p>For the second part of the velocity experiment, when running the NF, we periodically upgrade the new offloading component using a dynamic library through the orchestrator. Using the modified NF model in the director, the orchestrator will send instructions to the runtime to replace the new action and change the related data structure accordingly. Figure <ref type="figure">10</ref> shows that in our system, the updates can happen without disruption.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Flexible Placement</head><p>Placement of Offloading actions: First, we evaluate the flexible placement of the regular expression matching engine of the IDS. We generate a workload containing only suspicious flows. We compare the performance of IDS that co-locates the data actions and offloading action on the same core and the one that places them on the different cores. Figure <ref type="figure">8</ref> illustrates the relationship between the throughput, the number of cores that run offloading action, and the total number of cores that we use. It shows that when the workload on the regular expression matching engine is high, adding more cores for the data components will not improve the performance. A similar situation happens for using 1,2 and 3 offloading cores. When we use 4 offloading cores, the bottleneck on the data plane is totally removed. The figures also show that fine-grained resource allocation can lead to efficient resource utilization with less number of cores to achieve the same performance.</p><p>Placement Strategy Analysis: Based on different workload characteristics and performance requirements, we have different optimal placement plans -whether we should colocate the data action and offloading action on the same cores. We can adjust the computation overhead of offloading actions of IDS by changing the number of the pattern matching rules it needs to process, and frequency of its invocations by changing the inter-arrival of suspicious packets. Both figure <ref type="figure">11a</ref> and figure <ref type="figure">11b</ref> show that increased computational complexity of offloading actions and decreased inter-arrival of offloading actions lead to performance degradation. By placing them on different cores, figure <ref type="figure">11a</ref> shows the throughput can be boosted since the work is offloaded to other cores. The comparison in figure <ref type="figure">11c</ref> shows when each placement plan will outperform the other. When the frequency is lower and the cost of offloading action is higher, placement on different cores tends to perform better. It implies that for optimal performance, we need to change the placement based on the workload in the runtime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">RELATED WORK</head><p>A number of studies have been devoted to the state management problem for scaling stateful NFs and the consistency and robustness implications. For example, Split-&amp;-Merge <ref type="bibr">[20]</ref> separates per-flow states and shared states, with the latter managed centrally; this direction is further extended in <ref type="bibr">[10,</ref><ref type="bibr">13]</ref>. In contrast, Stateless <ref type="bibr">[12]</ref> advocates centrally managing all NF states (in a centralized controller), whereas S6 <ref type="bibr">[24]</ref> employs a DHT key-value store maintained among servers running various NF instances for distributed state management. The consistency issues due to sharing of NF states are addressed in <ref type="bibr">[14]</ref> which advocates using locking mechanisms to ensure correctness; this work is further extended in <ref type="bibr">[11]</ref> which employs database techniques. All these studies factor out only the NF state from NFs and focus on scaling across servers. There are existing serverless NFV platform <ref type="bibr">[19,</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref><ref type="bibr">[23]</ref>, providing a containerized environment for flexible NF deployment, but they do not distinguish data vs. control components, which are crucial to scale complex stateful NFs. Without such decoupling and visibility into the NF's stateful operations, it is infeasible to scale existing monolithic "opaque" and "black box" NFs to 100 Gbps and beyond. OpenBox <ref type="bibr">[7]</ref> and NetBricks <ref type="bibr">[16]</ref> advocate decomposing NFs into reusable modules, but the focus is more on removing redundant operations to speed up packet processing. In contrast, we tackle challenges in scaling stateful NFs by calling for a need to re-architect the NF's state and behavior and develop novel abstractions and a principled approach for refactoring and scaling stateful NFs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION</head><p>Without proper design considerations, monolithic scaling solutions with coarse resource allocation fail to scale in the context of 100 Gbps network traffic for modern multi-core architecture. To fill this gap, a finer granularity of visibility by separating the control/data/offloading states and actions of network functions is imperative for optimal performance. We propose a programming framework GranularNF that is useful for the reasoning of scaling network functions. We also provide the runtime and the director, which can support the dynamic scaling of stateful network functions. We empirically show the efficacy of the framework using a load balancer, IDS, and 5G UPF.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGEMENT</head><p>We would like to thank the anonymous reviewers for their comments. The research was supported in part by NSF under Grants CNS-1814322, CNS-1831140, CNS-1836772, CNS-1901103, CNS-2106771, CNS-2045478 and CCF-2123987.</p></div></body>
		</text>
</TEI>
