<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A survey on TCP enhancements using P4-programmable devices</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10359077</idno>
					<idno type="doi">10.1016/j.comnet.2022.109030</idno>
					<title level='j'>Computer Networks</title>
<idno>1389-1286</idno>
<biblScope unit="volume">212</biblScope>
<biblScope unit="issue">C</biblScope>					

					<author>Jose Gomez</author><author>Elie F. Kfoury</author><author>Jorge Crichigno</author><author>Gautam Srivastava</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The increasing performance requirements of today's Internet applications demand a reliable mechanism to transfer data. Many applications rely on the Transmission Control Protocol (TCP) as the transport protocol, due to its ability to adapt to properties of the network and to be robust in the face of many kinds of failures. However, improving the performance of applications that rely on TCP has been limited by the closed nature of legacy switches, which do not provide accurate visibility of network events. With the emergence of P4-programmable devices, developers can rapidly implement and test customized solutions that use fine-grained telemetry, provide sub round-trip time feedback to end devices to enhance congestion control, precisely isolate traffic to offer better Quality of Service (QoS), quickly detect congestion and re-route traffic via alternate paths, and optimize server resources by offloading protocols.This paper first surveys recent works on P4-programmable devices, focusing on schemes aimed at enhancing TCP performance. It provides a taxonomy classifying the aspects that impact TCP's behavior, such as congestion control, Active Queue Management (AQM) algorithms, TCP offloading, and network measurement schemes. Then, it compares the P4-based solutions, and contrasts those solutions with legacy implementations. Lastly, the paper presents challenges and future trends.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The Transmission Control Protocol (TCP) <ref type="bibr">[1]</ref> has been adopted as the standard transport protocol used by most Internet applications to transfer data. Efforts to enhance the performance and efficiency of TCP and other transport protocols <ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref> focus on improving congestion control algorithms, performing fine-grained network measurements, and offloading transport protocol functions to specialized network hardware.</p><p>With the advent of programmable data planes, developing custom protocols is performed in a top-down approach, facilitating the experimentation of novel ideas, diverging from the development process that the networking industry traditionally follows. The Programming Protocol-independent Packet Processor (P4) <ref type="bibr">[6]</ref> language is widely adopted to describe the forwarding behavior of programmable data planes. P4 is a domain-specific language, which provides a standardized way to describe how packets are processed independently of the target's architecture. TCP congestion control and Active Queue Management (AQM) algorithms can benefit from the P4's language packet manipulation features to have more control and a better view of network events. Moreover, P4 is supported by various network components such as programmable Network Interface Cards (NICs). They can Email addresses: gomezgaj@email.sc.edu (Jose Gomez), ekfoury@email.sc.edu (Elie F. <ref type="bibr">Kfoury)</ref>, jcrichigno@cec.sc.edu (Jorge Crichigno * ), srivastavag@brandonu.ca (Gautam Srivastava).</p><p>* Corresponding author offload network stack operations and server workloads by executing them locally and reducing the communication overhead via bypassing the Operating System (OS) kernel.</p><p>Emerging technologies such as 5G, cloud computing services, big data analysis, and the Internet of Things (IoT) demand high utilization, low latency, and dynamic management of computer networks. Nevertheless, employing legacy networking devices to implement such technologies increases the complexity of the underlying network and provides less flexibility to apply changes or add new features. The management and maintenance of the legacy network devices depend on the features that the vendor included in the device, which constrains operators to define the forwarding behavior and customize traffic monitoring. Furthermore, legacy devices suffer from the network ossification problem <ref type="bibr">[7]</ref>, which is a significant barrier to implementing new protocols.</p><p>This paper surveys and discusses schemes that enhance TCP performance using P4-programmable data planes, focusing on four main categories: congestion control, AQM algorithms, TCP offloading using programmable NICs, and network diagnosis schemes that leverage P4-programmable data planes to detect, report, and troubleshoot TCP performance degradation.</p><p>Traditionally, congestion control is achieved using end-to-end mechanisms with little support from the network, relying on packet losses to share resources fairly. AQM algorithms complement congestion control schemes by taking actions in the intermediate nodes (i.e., switches) rather than waiting for the end host to react. In such a way, AQMs prevent network congestion from escalating. This survey covers RFC-standardized and custom  AQM algorithms implemented in P4, showing how P4programmable data planes are employed to realize protocols such as Random Early Detection (RED) <ref type="bibr">[8]</ref>, Proportional-Integral controller Enhanced (PIE) <ref type="bibr">[9]</ref>, CoDel <ref type="bibr">[10]</ref>, Common Application Kept Enhanced (CAKE) <ref type="bibr">[11]</ref> and Fair Queueing (FQ) <ref type="bibr">[12,</ref><ref type="bibr">13]</ref>. Additionally, this paper covers custom AQMs designed to solve performance issues by implementing a solution based on theoretical approaches. Another critical limitation that affects TCP performance is produced by the processing time that packets spend in the end host. TCP and other transport protocols are implemented in the end hosts, which use the generalpurpose CPU to encapsulate, verify and fragment TCP segments. P4-programmable NICs can offload TCP and part of the kernel networking processes, making packet processing closer to the network, reducing the processing overhead, and releasing the server's resources. This survey covers works related to P4-programmable NICs that extend, offload and optimize the resources used by protocols such as TCP and applications entirely implemented in the NIC.</p><p>This paper also presents relevant works on network monitoring and diagnosis. Programmable data planes have more granularity to detect, report, and react to network issues at a line rate. Compared to legacy flow export protocols such as NetFlow <ref type="bibr">[23]</ref> and IPFIX <ref type="bibr">[24]</ref>, programmable data planes can be employed to detect network flaws and mitigate them in an early stage. Moreover, either the control or data plane can take actions (e.g., marking, dropping, prioritizing packets, precomputing arithmetic operations, and rerouting traffic) to mitigate a failure and provide an accurate report of net-work events.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Contributions</head><p>To the best of the authors' knowledge, the literature has been missing a survey on TCP enhancement leveraged on P4-programmable data planes. Programmable data planes reduce the complexity of forwarding devices, facilitate adding new features, and supports fine-grained telemetry. This paper also discusses the usage of P4programmable NICs to offload the server's resources. Performing part of the computation in the NIC, offloads servers from handling network stack processing. Furthermore, it also describes P4-based network diagnosis schemes used to report and mitigate performance issues. The granularity offered by programmable data planes is essential to identify and minimize performance degradation issues in an early stage.</p><p>This paper also provides a comparison between P4 schemes in the same category and with legacy approaches. Additionally, the challenges and future trends are also discussed. The main contributions of this survey can be framed as follows:</p><p>&#8226; Discussing the key aspects that make programmable data planes good candidates for improving network performance.</p><p>&#8226; Surveying the most relevant works that leverage data plane programmability to improve TCP performance and efficiency.</p><p>&#8226; Proposing a taxonomy that categorizes the works that enhance network and TCP performance.</p><p>&#8226; Describing the common challenges and considerations and listing current and future initiatives that can advance the state-of-the-art systems.</p><p>&#8226; Discussing challenges and limitations of the surveyed works.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Paper Organization</head><p>Figure <ref type="figure">1</ref> illustrates the paper roadmap. Section 2 describes the related surveys. Section 3 provides a background congestion control algorithms, AQMs and P4. Section 4 described the proposed taxonomy, section 5 surveys the P4-assisted congestion control, section 6 describes the P4-based AQMs, section 7 covers the improvements that P4 programmable NICs provide to offload network operations, section 8 explains how P4 devices are employed to detect, prevent and mitigate network performance issues. Section 9 discusses the challenges and future work and, section 10 concludes the paper.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">Routers and Switches</head><p>Traditionally, a switch refers to a layer-2 device, and a router refers to a layer-3 device. However, with programmable data planes, the programmer decides what layers and protocols to parse and process. Thus, in this article, the terms switch and router will be used interchangeably, noting that a switch can process not only layer-2 protocols but also upper-layer protocols.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Surveys</head><p>Dargahi et al. <ref type="bibr">[14]</ref> surveyed the security applications enabled by Software-Defined Networking (SDN). They provide a background on stateful SDN data planes, dissect the relevant security issues and present an analysis of concrete use case applications such as the port knocking application, the UDP flooding mitigation application, and the failure recovery. They conclude the paper by discussing significant vulnerabilities and possible attacks to the stateful SDN data plane. This survey does not cover any topic related to TCP and how P4-programmable data planes can improve aspects of such protocol.</p><p>Bifulco and Retvari, <ref type="bibr">[15]</ref> presented a survey on programmable data planes that focuses on the recent trends and issues in the design and implementation of programmable network devices. They cover the abstractions and architectures proposed, debated, realized, and deployed during the last decade. The authors discuss future research direction, providing a list of open problems based on abstractions, simplified operations, data plane optimization, and verification frameworks. The survey does not describe current works that aim at solving the issues covered in the paper.</p><p>Satapathy <ref type="bibr">[16]</ref> presented a survey on SDN and P4. The author analyses the issues of traditional networks and how programmable data planes can be employed to support and improve cloud services and big data applications. The author presents two use-cases consisting of advanced network telemetry and DDoS attack mitigation using programmable data planes. The survey also collects and compares P4 related work. The study concludes by focusing on future works related to the evolution of programmable data planes, P4 language, and defining highlevel policies that can be implemented using P4.</p><p>Kaljic et al. <ref type="bibr">[17]</ref> provides a comprehensive survey on programmable data planes architectures with a particular emphasis on the problem of programmability and flexibility. The authors provide an overview of software and hardware technologies that enable SDN data plane implementations, identifying key factors that deviate from the original data plane architectures related to Forwarding and Control Element Separation-based (ForCES) and OpenFlow specifications. The paper categorizes and describes works that include performance improvements, energy consumption, quality of service, measurement and monitoring, security, and integration with other network technologies. Finally, they discuss and propose future research directions that should be considered to develop data plane architectures. This survey does not provide any comparison between related schemes and legacy devices.</p><p>Tan et al. <ref type="bibr">[18]</ref> presented a survey about the latest applications leveraged on In-band Network Telemetry (INT). The authors analyze the development history, research situation, and application results of INT technology. The work covers INT applications such as congestion control algorithms, network troubleshooting, microburst detection, and traffic engineering. Although the survey discusses key technologies in the field, it does not compare the works within the same category and with legacy devices.</p><p>Hauser et al. <ref type="bibr">[19]</ref> present a survey covering P4 and its applied research domains. The paper starts by introducing the concept of data plane programming and the relationship with adjacent concepts. They describe the P4 programming language, architecture, compilers, targets, and data plane APIs. They also include works in areas such as optimization of development and deployment, research on P4 targets, and P4-specific approaches for control plane operations. This survey contains brief descriptions of P4-related works but lacks an in-depth analysis and comparison between each category's results.</p><p>Kaur et al. <ref type="bibr">[20]</ref> present a survey in programmable data planes in the context of SDN. The authors provide a comparison of the related works, background on P4 language, and the most relevant data plane architectures. They summarize and discuss works related to network monitoring, security, load balancing, and packet aggregation. The paper lacks inter and intra-category comparison and does not explore aspects that involve TCP performance. The paper concludes by providing a discussion about research gaps in programmable data planes.</p><p>Kfoury et al. <ref type="bibr">[21]</ref> presented an exhaustive survey related to P4 programmable data planes. It includes a general overview of the P4 language and its applications. The authors thoroughly reviewed the literature to provide a taxonomy of applications developed with P4 language, surveying, classifying, and discussing challenges collected from more than 200 articles. It also presents future perspectives and open research issues. However, the survey provides a general description of each topic and does not cover TCP-related aspects.</p><p>Paolucci et al. <ref type="bibr">[22]</ref> surveyed P4-programmable data planes use cases in the context of 5G SDN and Network Function Virtualization (NFV) edge. The authors present the benefits of P4-programmable devices for edge networks. Then, they present use cases and discuss research directions related to traffic engineering, in-band telemetry, network slicing and multi-tenancy, cybersecurity, and 5G VNF offloading. The authors conclude by highlighting the benefits of P4 in edge/fog and industrial applications and sustain that the flexibility of programmable data planes paves the way for massive adoption.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Scope of this Survey</head><p>To the best of the authors' knowledge, this is the first survey that focuses on the TCP enhancements using P4programmable data planes. This survey compels works that employ P4-programmable data planes to improve or solve current issues that involve congestion control algorithms, AQMs, and P4-programmable NICs. Additionally, it presents schemes that aim at diagnosing and mitigating performance issues in the network. Each section of this survey provides an overview of P4-based schemes, highlighting key features and comparing them with legacy approaches (i.e., inter-category comparison). Moreover, this paper also discusses the limitations that result from comparing the works in the same category (i.e., intracategory comparison). Finally, this survey analyzes the most relevant aspects of current P4-based schemes and presents the challenges and future trends.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Background in Programmable Data Planes</head><p>The emergence of programmable data planes results from the technology evolution of SDN, which provides flexibility, enables programmability, and facilitates network automation by centralizing the network intelligence into a controller. An SDN controller keeps a global view of network topology, making network management more efficient than legacy approaches. However, SDN devices do not allow data plane programmability. They only recognize a fixed set of header fields defined by a communication protocol such as OpenFlow <ref type="bibr">[26]</ref>.</p><p>With P4, the programmer defines the packet headers and the forwarding behavior. Furthermore, P4 programs can run on different platforms without modifying the runtime application (i.e., the control plane and the interface between control and data planes are target agnostic). Although the technology maturity and support for P4 devices can still be considered low compared to legacy and SDN devices, programmable data planes enable operators to gain more control over the network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">P4 Architectures</head><p>Data plane architectures have been proposed to map P4 programs onto high-speed network ASICs. Examples of P4 architectures include the Protocol Independent Switching Architecture (PISA), the Portable Switch Architecture (PSA), and the Tofino Native Architecture (TNA). These architectures represent a forwarding model that can be implemented in various target devices such as software switches, hardware switches, and Network Interface Cards (NICs). The goal of a P4 architecture is to define a programming model to describe how packets are being processed. Although data plane algorithms can be characterized using programming languages such as Python or C++, they do not map well into high-speed ASICs. Thus, P4 has been proposed as a tool to describe the packet processing behavior.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1.">The PISA Architecture</head><p>Figure <ref type="figure">2</ref> illustrates the PISA architecture. PISA is an abstraction of a packet processing model comprising the following elements:</p><p>&#8226; Programmable parser: extracts the packet header information for further processing and can be represented as a state machine.</p><p>&#8226; Programmable match-action pipeline: performs actions over the packet headers and intermediate results. A match-action stage includes multiple memory (e.g., tables, registers) and Arithmetic Logic Units (ALUs) to enable multiple lookups and actions simultaneously.</p><p>&#8226; Programmable deparser: recomposes the modified packet headers to emit them with the corresponding payload.</p><p>The relevance of using a multi-stage pipeline to process packets rather than a general-purpose single stage CPU is that processing network packets involves looking up into multiple header fields. Although a multi-stage pipeline could add processing latency, it can process multiple packets simultaneously. For instance, if a packet is being modified in stage 2 of the pipeline, a lookup is taking place in stage 1, thus enabling high-speed packet processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2.">The PSA Architecture</head><p>P4-programmable data plane devices solve many of the SDN limitations. Over the last few years, a group of researchers developed a machine model for networking. The model is known as the Portable Switch Architecture (PSA) <ref type="bibr">[27,</ref><ref type="bibr">28]</ref> and describes the standard capabilities of a network switch. The PSA was designed with instruction sets optimized for network operations, and P4 is the highlevel language for programming PSA devices.</p><p>The PSA is a P4 architecture created and developed by the P4 consortium, more specifically the P4 working group <ref type="bibr">[25]</ref>. Figure <ref type="figure">3</ref> depicts the processing pipeline.  <ref type="bibr">[25]</ref>.</p><p>The PSA also provides P4 libraries known as externs, which extend data plane features by supporting functional constructs, namely counters, meters, registers, and a set of "packet paths" that enable P4 programs to control the flow of packets in a forwarding plane. The PSA processing model comprises the following elements: parser, control blocks, and deparser. The architecture also defines configurable fixed-function components and packet primitives such as:</p><p>&#8226; Sending a packet to a unicast port: It is involved in regular forwarding operations.</p><p>&#8226; Dropping a packet: It is used to respond to events that demand regulating congestion or queue length. P4 programs might induce packet drops to regulate TCP's sending rate, avoid TCP synchronization, and improve fairness, link utilization and other metrics.</p><p>&#8226; Sending a packet to a multicast group: It can be used to efficiently send packets to a group of hosts, which can facilitate the implementation of streaming applications, multicast VPN, Content-Delivery Networks (CDN), and other data broadcasting operations.</p><p>&#8226; Cloning: It creates a copy of the packet. The original packet is independent of the cloning operation. For example, packet cloning can ensure that all packets arrive at their destination over lossy links <ref type="bibr">[29]</ref>.</p><p>&#8226; Resubmission: It permits reprocessing the current packet by moving it from the end of the ingress pipeline to its beginning. Resubmission is typically applied for packet re-parsing purposes. A scheme might need to parse a header field multiple times before moving it to the packet buffer and replication engine.</p><p>&#8226; Recirculation: It restarts the ingress processing of a packet after finishing the egress pipeline. It is used when packets headers contain a large number of fields that cannot be parsed at once. This primitive can also be used to reprocess queuing statistics (e.g., in approaches that implement AQMs).</p><p>Currently, vendors do not provide the compiler backend that maps the PSA architecture onto their respective ASICs. Therefore, the PSA architecture remains as a theoretical model. However, other packet processing architecture presents variations of the arrangement of the blocks present in the PSA architecture. Examples of such variation are the V1model, the Simple SUME architecture, and the Tofino Native Architecture (TNA). The latter was developed to describe Intel Tofino packet processing pipelines and presents similarities to the PSA architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.3.">The Tofino Native Architecture (TNA)</head><p>TNA is an architecture model <ref type="bibr">[30]</ref> defined by Intel for their family of Tofino switching ASICs. TNA consists of an ingress parser, an ingress match-action control block, an ingress deparser, an egress parser, an egress match-action control block, and an egress deparser. The TNA architecture model is implemented in the Intel Tofino ASIC, which processes packets at a line rate independently of the complexity of the P4 program. This increased performance is enabled by a high degree of pipelining and parallelization. Recently, Intel released the second generation of Tofino ASICs that offers more flexibility, more performance (up to 12.9Tbps), more P4 resources, and less energy consumption <ref type="bibr">[31]</ref>. The performance and programmability of Tofino 2 aim at fulfilling the needs of large data centers and cloud service provider networks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">P4-programmable Devices</head><p>P4-programmable devices (i.e., P4 targets) are packet processing devices in which their behavior is described by the P4-language <ref type="bibr">[6]</ref>. That means that P4-programmable devices execute an externally defined processing algorithm that differentiates them from configurable or flexible approaches <ref type="bibr">[32]</ref>. P4-programmable devices allow programmers to determine the processing algorithm implemented in a network and individual packet processing nodes (e.g., hardware and software switches, NICs, and network appliances).</p><p>Depending on the purpose of these devices, they usually present variations of the PSA, which includes switches, NICs, NetFPGAs, and virtualized data planes such as P4-OVS <ref type="bibr">[33]</ref> and P4rt-OVS <ref type="bibr">[34]</ref> compatible hardware and virtual targets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">P4-language</head><p>P4 is a language for programming protocolindepenent packet processors. It provides an abstract model and constructs for programming the data plane optimized to describe the forwarding behavior. A P4enabled device is protocol-independent, which means that it is not tied to any specific network protocol. P4 programs allow the definition of packet headers and specifying packet parsing and processing. The P4 language is developed and standardized by the P4 language consortium <ref type="bibr">[36]</ref>, it is supported by various software-based and hardware-based target platforms, and it is widely applied in academia and industry.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">P4-programmable Devices and TCP performance</head><p>P4 allows the programmer to define custom packet processing schemes to improve TCP and other protocols' performance. Such features allow more granular control over the network by monitoring and mitigating issues that impact TCP performance, such as congestion, low link utilization, and unfairness. P4 is also used to describe the behavior of network elements, such as hardware and software switches, NICs, and FPGA-based prototypes. P4programmable devices have several unique features that differentiate them from legacy devices (i.e., fixed-function switches, non-programmable NICs). Table <ref type="table">2</ref> compares some features of P4-programmable devices and legacy/ non-programmable devices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.1.">Fine-grained Telemetry</head><p>Network telemetry is necessary for troubleshooting TCP performance problems. Programmable data planes supports In-band Network Telemetry, which allows the development of novel schemes that benefit from finegrained telemetry data to report and take actions in the presence of network flaws. On the other hand, in legacy networks, important information is often missed due to the coarse nature of traditional monitoring schemes (e.g., NetFlow <ref type="bibr">[23]</ref>, sFlow <ref type="bibr">[38]</ref> and IPFIX <ref type="bibr">[39]</ref>). Programmable data planes can also add telemetry data to packet headers and use the metadata to infer queue latency and detect microburst. P4-programmable devices enable custom packet processing. Each packet in the pipeline is accompanied by its metadata (e.g., queue length, timestamps from different processing stages) which can be used as inputs to determine the actions taken. Moreover, programmable data planes can calculate per-flow statistics at line rate and metrics such as packet loss, RTT, and queueing delay.</p><p>The P4 language consortium became part of the Open Networking Foundation <ref type="bibr">[35]</ref> since 2019.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.2.">Traffic Isolation</head><p>Using P4 common constructs, programmers can establish traffic isolation by slicing the network according to policies (e.g., latency, drop, bandwidth allocation, etc.) and QoS requirements. P4-targets can handle this operation with little or no performance penalty. For instance, TCP congestion control algorithms presents different variants (e.g., CUBIC, Reno, BBR, DCTCP), leading to unfairness issues when several flows share a link. The flexibility provided by match-action tables permits the allocation of TCP flows in different queues. For instance, flows can be classified according to the type of congestion control, the duration of the flow (i.e., separating elephants and mice flows), or according to userdefined policies such as prioritizing flows of an application.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.3.">Fast Reaction Time upon Congestion</head><p>Both programmable data planes and legacy devices process packets at line rate. With programmable data planes, developers can specify the precise actions to mitigate issues such as congestion, high latency, or poor link utilization. For instance, data planes can avoid congestion by applying fast rerouting. This approach consists of having a primary and multiple backup routes if the primary link cannot fulfill the QoS requirements imposed by the application. Rerouting can quickly adapt if the network condition changes or the operator wants to apply policy changes. On the other hand, operators must adapt predefined functions with legacy devices to mitigate the previously mentioned issues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.4.">Protocol and Application Offloading</head><p>P4 targets can handle protocol mechanisms previously managed by the server's CPU. For example, the threeway handshake can be handled by a P4-programmable NIC to offload the server's CPU. Additionally, cloud applications such as microservices, can be entirely implemented in a P4-programmable NIC. TCP is a transport protocol that is implemented in the end host. Therefore, offloading some operations to a Network Processor Unit (NPU) can distribute the workloads and improve performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.5.">Microburst Detection</head><p>Switch's buffers aim to absorb the surge of packets, and thus minimize packet losses. However, legacy network devices usually do not provide good visibility of the queue dynamics, making it harder to detect, diagnose and fix network issues. Furthermore, in highspeed network environments (e.g., data centers), shortlived spikes in network traffic can lead to periods of high queue utilization. These network spikes are known as microbursts. Microbursts produce a sudden increase in the queue length within a sub-millisecond time scale exhausting the egress buffer. Microbursts detection is a functionality that is not available in legacy network devices but can be implemented using P4-programmable devices. P4 enables the development of fine-grained measurements schemes that can help to monitor the queue, detect microbursts and perform actions to counter them. The correct handling of microbursts in the network reduces the packet loss rate and contains congestion at prudent levels.</p><p>A Survey on TCP Enhancements using P4-programmable Devices</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Congestion Control</head><p>Active Queue Management TCP Offloading Network Measurements</p><p>End-host Implementation <ref type="bibr">[40]</ref><ref type="bibr">[41]</ref><ref type="bibr">[42]</ref><ref type="bibr">[43]</ref><ref type="bibr">[44]</ref><ref type="bibr">[45]</ref><ref type="bibr">[46]</ref> In-network Implementation <ref type="bibr">[47]</ref><ref type="bibr">[48]</ref><ref type="bibr">[49]</ref><ref type="bibr">[50]</ref><ref type="bibr">[51]</ref><ref type="bibr">[52]</ref> RFCstandardized Algorithms <ref type="bibr">[53]</ref><ref type="bibr">[54]</ref><ref type="bibr">[55]</ref><ref type="bibr">[56]</ref><ref type="bibr">[57]</ref><ref type="bibr">[58]</ref> Custom Algorithms <ref type="bibr">[59]</ref><ref type="bibr">[60]</ref><ref type="bibr">[61]</ref><ref type="bibr">[62]</ref><ref type="bibr">[63]</ref><ref type="bibr">[64]</ref> Protocol Offloading <ref type="bibr">[65]</ref><ref type="bibr">[66]</ref><ref type="bibr">[67]</ref><ref type="bibr">[68]</ref><ref type="bibr">[69]</ref> Application Offloading <ref type="bibr">[70]</ref><ref type="bibr">[71]</ref><ref type="bibr">[72]</ref><ref type="bibr">[73]</ref><ref type="bibr">[74]</ref> Network Diagnosis <ref type="bibr">[75]</ref><ref type="bibr">[76]</ref><ref type="bibr">[77]</ref><ref type="bibr">[78]</ref><ref type="bibr">[79]</ref><ref type="bibr">[80]</ref><ref type="bibr">[81]</ref> Figure <ref type="figure">4</ref>: Taxonomy of TCP enhancements using P4-programmable devices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Methodology and Taxonomy</head><p>Figure <ref type="figure">4</ref> depicts the proposed taxonomy. The taxonomy was designed to cover relevant works on P4programmable data planes where the scheme aims to improve TCP performance. This work aims to categorize the surveyed works based on various high-level disciplines.</p><p>The taxonomy provides a separation of categories so that a reader interested in a specific topic can move to the corresponding section and find the necessary background to understand the section's content. Each high-level category is further divided into sub-categories. For instance, works related to "End host Implementation" correspond under the "Congestion Control" category.</p><p>Each section has a table that summarizes relevant characteristics of each work discussed in that section. Following the literature review, each section presents an inter and intra-category comparison. At the end of each section, relevant outcomes are discussed, highlighting the surveyed schemes' strengths and weaknesses. Figure <ref type="figure">5</ref> shows the distributions of the surveyed papers per year and category.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Congestion Control</head><p>Applications need a reliable rate control mechanism to send data with high performance and fully utilize the   bandwidth while avoiding congestion. The TCP congestion control aims to fulfill those demands by maximizing the rate and managing how many packets to inject into the network. Congestion control algorithms aim at efficiently using network resources and minimizing congestion escalation, thus preventing the network from collapsing. P4-programmable data planes provide the means to enhance and create new congestion control algorithms. Methods such as isolating the flows' dynamic by allocating them in different queues, using telemetry information to have a better view of network events, and enhancing current TCP schemes that rely on congestion signals such as ECN, can be implemented and enhanced using P4-programmable data planes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Understanding TCP Issues</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1.">Unfair Resource Allocation</head><p>Allocating resources to match the demand of applications implies that the network takes an active role in controlling the congestion. The distributed nature of the network components (i.e., senders, middleboxes, and receivers) makes designing a resource allocation mechanism challenging. Thus, developers opted to implement the congestion control mechanism in the end host. This approach allows end hosts to decide the amount of data sent into the network and use a control mechanism to manage the congestion. However, some applications could experience an unfair resource allocation.</p><p>Congestion control mechanisms aim to ensure high throughput and low latency by controlling the sending rate. For instance, a congestion control mechanism that injects packets into the network to keep network links busy also increases the number of packets in routers' queues, resulting in an increased queuing delay. With programmable data planes, developers can implement schemes that accurately report to end hosts queueing metrics without incurring excessive overhead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2.">Unfairness among Competing TCP Flows</head><p>Congestion is an undesirable event for the network, and preventing it can be achieved by reducing some nodes Then, the incoming packets cannot be handled by the switch. Thus, TCP suffers performance degradation <ref type="bibr">[82]</ref>.</p><p>from sending data, thus releasing resources that other nodes can use. Another method to mitigate congestion consists of reducing the sending rate of end hosts to allow a fair resource allocation among competing participants. TCP congestion control algorithms attempt to ensure fairness. However, enforcing fairness among competing flows might require explicit information from the network, such as the number of flows using a congested link. A possible solution is implementing a reservation-based resource allocation scheme that can enforce rate policies based on the information obtained from the network. P4programmable data planes can support schemes that collect and process network metrics in a custom manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.3.">Inaccurate Congestion Feedback</head><p>Receiving precise and timely feedback information is essential for TCP to make a good decision in controlling the sending rate. Schemes that focus on improving feedback information are usually modeled as an observation control loop (i.e., they react after observing events such as packet losses or failures). Legacy networks are limited to what they observe and what is needed to make the right control decision <ref type="bibr">[83]</ref>. For example, delay-based congestion control algorithms consider delay as a congestion signal, which can be a poor indicator and produce an uncorrelated number of retransmissions. P4-programmable data planes can feed congestion control algorithms with the precise information to perform more specific actions such as reducing the sending rate based on metrics not implemented in standard protocols.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.4.">TCP Incast</head><p>TCP incast occurs when multiple servers concurrently send packets to a single client, blocking another server from transmitting data (see Figure <ref type="figure">6</ref> ) <ref type="bibr">[84]</ref>. In such scenario, the bottleneck switch cannot handle all the incoming flows resulting in the client experiencing much lower throughput than the link capacity. TCP incast events increase flows' queuing delay and reduce applicationlevel throughput below the link bandwidth. With P4programmable data planes, data center operators have a granular resolution of the events occurring in the network, thus facilitating the tools to mitigate or reduce the impact of TCP incast events. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">End host Implementations</head><p>This section describes TCP congestion control implementations that require the end hosts to participate in regulating the sending rate.</p><p>Traditional TCP congestion control algorithms are mainly end host implementations. The basic strategy of TCP congestion control consists of sending packets into the network and reacting to congestion events. This approach allows sending hosts to determine how many packets can safely traverse the network. TCP regulates the sending rate by changing the size of its congestion windows according to a function. This function dictates the variation of the congestion window size in the presence of congestion. The most popular mechanism that governs the congestion window size is called Additive Increase-Multiplicative Decrease (AIMD). Figure <ref type="figure">7</ref>(a) shows the AIMD function used to regulate the sending rate.</p><p>Figure <ref type="figure">7</ref>(b) explains how the AIMD mechanism helps competing flows to converge to fairness. The bandwidth allocation to flow one is on the x-axis, and flow two is on the y-axis. Suppose that TCP shares the bottleneck bandwidth equally between the two flows. In that case, the bandwidth will fall along the fairness line that starts from the origin. When the sum of the allocation is equal to 100% of the bottleneck capacity, the allocation is efficient. On point A, flow 1 receives 100% of the capacity, and on point B, flow two receives 100% of the capacity. These solutions are not desirable, as they lead to starvation and unfairness. Assume that point p1 depicts the sending rates of senders one and two at a given time. The dynamics governed by the AIMD rule will eventually guide the two bandwidths to converge at the optimal point at (R/2, R/2). Chiu et al. <ref type="bibr">[85]</ref> describe the reasons why TCP converges to a fair and efficient allocation. This convergence occurs independently of the starting point <ref type="bibr">[86]</ref>.</p><p>Not all the congestion control algorithms are governed by the AIMD rule. For instance, BBRv1 <ref type="bibr">[87]</ref> is a ratebased algorithm, meaning that at any given time, it sends data at a rate without considering packet losses. During the probe period (1 RTT duration), the sender probes for additional bandwidth, sending at a rate of 125% of the bottleneck bandwidth. During the subsequent period, drain (1 RTT duration), the sender reduces the rate to 75% of the bottleneck bandwidth, allowing any bottleneck queue to drain. Figure <ref type="figure">8</ref> shows RTT and delivery rate as a function of the inflight data. Loss-based con- gestion control algorithms operate at the right edge of the bandwidth-limited region, using the full bottleneck bandwidth at the cost of high delay and frequent packet loss. On the left edge, it is observed Kleinrock's operating point <ref type="bibr">[88]</ref>. At this point, the delivery rate is maximum, and the RTT is the optimal minimum. BBRv1 operates at this point. ECN is a TCP/IP mechanism to notify about the congestion by sending explicit feedback to the sender before congestion escalates. Dropping a packet certainly works as a signal of congestion for traditional congestion control algorithms (e.g., Reno, CUBIC), which improves the performance of long-lived bulk data transfers. However, real-time applications (e.g., videoconference, video streaming, industrial control) are affected due to delay and losses. For such low-bandwidth delay-sensitive TCP traffic, packet drops and packet retransmissions can cause noticeable delays for the user. These delays can escalate for some connections due to the coarse-granularity of the TCP timer that delays the source's retransmission of the packet <ref type="bibr">[89]</ref>. Therefore, notifying the sender through an explicit notification signal is more appropriate to improve such applications' performance. Figure <ref type="figure">9</ref> describes the behavior of ECN. When the number of packets buffered in a switch exceeds a certain threshold, the Congestion Encountered (CE) IP header field is enabled. After the marked packets reach the receiver, the ECN field in the TCP header can notify the sender that the link is experiencing congestion. Consequently, the sender can slow down the sending rate before dropping packets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1.">Literature Review</head><p>Feldman et al. <ref type="bibr">[40]</ref> proposed a system called Networkassisted Congestion Feedback (NCF). It controls the rate by sending an explicit congestion notification to the sender using NACKs. The system ensures a fair sharing of resources across both mice and elephant flows, requiring only short queues. The authors consider that elephant flows are responsible for congestion in the network, thus they must be separated from mice flows. NCF separates</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sender</head><p>Receiver Router Threshold Buffer ECN marking Figure <ref type="figure">9</ref>: ECN mechanism. The switch enables the ECN field in the packet header after the congestion surpasses a predefined threshold. Then, it notifies the sender that the network is experiencing congestion before the switch starts dropping packets. them, resulting in higher throughput and reducing the impact of TCP incast. NCF works in both data centers and Wide Area Networks (WAN). Figure <ref type="figure">10</ref> shows an overview of the NCF mechanism. The system classifies the traffic into two main categories, elephant and mice flow. The system applies congestion control only to elephant flows. Mice flows are not subject to any congestion control. NCF identifies elephant flows by constantly updating rolling sketches (i.e., counters). Rolling sketches are P4 data structures that, in this use case, help to dynamically identify elephants without keeping the state for all flows.</p><p>Handley et al. <ref type="bibr">[41]</ref> presented Novel Datacenter Protocol (NDP), aiming to achieve low delay and high throughput simultaneously. NDP avoids the slow start phase and immediately sends data at the maximum rate. The method behind this scheme is called per-packet multipath load balancing, which avoids core network congestion at the expense of reordering. It also reduces the unfairness in incast situations. Incast increases the queuing delay of flows and decreases application-level throughput to far below the link bandwidth. To prevent congestion, NDP uses a per-packet multipath and trims the packet's payload in the event of queue filling when the switch's queue is increasing. Although the scheme produces some packet reordering, the receiver takes advantage of the metadata to achieve very low latency for short flows, with minimal interference between flows intended to different destinations. The authors claim that NDP improves the FCT of short flows in comparison to DCTCP and DCQCN. Under a heavily loaded network with a small switch buffer size, NDP utilizes over 95% of the maximum network ca- pacity and lowers the delay during TCP incast. Kfoury et al. <ref type="bibr">[42]</ref> proposed a novel scheme based on programmable data plane switches. In this scheme, the programmable switch notifies the sender of the right TCP pacing rate, resulting in a fair share of the bandwidth. The programmable switch embeds the pacing rate in the IP options header, making it compatible with legacy devices. Figure <ref type="figure">11</ref> shows the high-level architecture of the system. 1) An end host inserts a custom header replacing the IP-Options header field to initiate a TCP connection. The custom header is inserted during the 3-way handshake. On the other hand, the switches parse the received packets' headers and check if the end host requests a new flow. 2) Each switch stores the latest state S of the network and inserts its bottleneck link capacity and the current total number of hosts in the SYN-ACK message. 3) Then, the switches broadcast S' to all previously connected hosts. Finally, when the hosts receive the SYN-ACK message, they adjust their transmission rates according to the IP options header information. 4) Upon receiving a TCP packet on the port used for initiating its TCP connection, the end host parses the custom header and adjusts its transmission rate by applying pacing with a rate derived from the custom header fields values. Additionally, the processing overhead is minimal, as custom packets are only generated when a flow joins or leaves the network. With input from switches, end devices are dynamically notified to adjust the pacing rate. The scheme increases throughput and enhances fairness.</p><p>Li et al. <ref type="bibr">[43]</ref> presented a congestion control algorithm called High-Precision Congestion Control (HPCC). The congestion control is based on Multiplicative-Increase Multiplicative-Decrease (MIMD), which makes it fast, stable and responsive in the presence of packet losses. It achieves three principal goals: ultra-low latency, high bandwidth, and increased stability simultaneously in large-scale high-speed networks. It is easy to deploy and integrate with legacy networks, requiring only the activation of standard INT features. Results show a reduction of the FCT at least four times. The congestion is reduced compared to similar congestion control algorithms used in data centers such as DCTCP <ref type="bibr">[92]</ref>, TIMELY <ref type="bibr">[93]</ref>, and DCQCN <ref type="bibr">[94]</ref>. Figure <ref type="figure">12</ref> illustrates how HPCC works. When a packet arrives at the switch's ingress interface, the switch adds an INT header to the packet. When the TCP packet that contains data reaches the egress port of an edge switch, the INT information is piggybacked into the TCP packet that carries the ACK. The end hosts then use this information to adjust the sending rate in every ACK. Shahzan et al. <ref type="bibr">[44]</ref> optimized traditional ECN-based congestion notification systems, which rely on the receiver to indicate congestion. Usually, ECN-based systems wait for an RTT before the sender reacts to the congestion. The authors state that this could be an issue in networks with a high BDP. Therefore, they propose an enhanced ECN mechanism for the early detection of congestion using P4 devices. The system is called Enhanced Explicit Congestion Notification (EECN). In this approach, the sender does not have to wait for the receiver to indicate congestion. Instead, the switches notify the end hosts. Experimental results show that their system does not generate any additional traffic in the network, thus reducing the notification time compared to traditional ECN-based schemes.</p><p>Kang et al. <ref type="bibr">[90]</ref> presented Proactive Congestion Notification (PCN), which uses Distributed Deep Learning (DDL) techniques to avoid congestion. The authors leverage DDL to collect training data from distributed workers (i.e., end hosts), thus reducing the total training time. However, experiments demonstrated that the DDL architecture induces bursty traffic. Therefore, the authors proposed PCN to prevent congestion. PCN proactively regulates the switch's queue length to prepare it when bursty traffic arrives. Results show that PCN improves the throughput of DDL traffic by 72% on average. Laraba et al. <ref type="bibr">[91]</ref> presented an approach for detecting and mitigating the impact of TCP misbehaviors. The scheme leverages programmable data planes to mitigate optimistic ACK attacks <ref type="bibr">[95]</ref> and ECN abuses <ref type="bibr">[96]</ref>. The authors propose a security monitoring system based on an Extended Finite State Machine (EFSM) to monitor stateful protocols in the data plane. Results show that the system can detect and discard optimistic ACKs at the switch without reaching the server. Consequently, a legit TCP connection does not experience performance degradation. With respect to ECN abuses, the authors reported that the scheme achieves a fair bandwidth share during an attack. They evaluated their approach in scenarios when ECN reaction is not enabled and when RED with ECN is configured in the switch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.2.">End hosts Implementations Comparison, Discussions, and Limitations</head><p>Table <ref type="table">3</ref> compares the schemes described in the previous section. In low latency environments, HPCC presents PINT handles concurrent queries while bounding the perpacket bit overhead using each packet for a query subset with cumulative overhead according to predefined policies. PINT mitigates the high bit overhead produced in schemes such as HPCC resulting in similar performance while getting good visibility. The works described in the previous section report the FCT as a metric that evaluates the performance of the schemes. The FCT is a key metric that describes how long it takes to complete a data transfer. <ref type="bibr">Dukkipati and McKeown [97]</ref> demonstrated that high throughput does not lead to a better FCT. On the contrary, it could increase due to several factors (e.g., high RTT, bloated buffers, flows take different paths). Schemes such as EECN do not require any change in the end host. Thus it can be easily integrated with current deployments. Although ECN does not provide better visibility than INT (i.e., it only notifies whether there is congestion), enabling explicit network feedback improves metrics such as the FCT, reduces packet losses, and ensures better fairness among competing flows.</p><p>NDP and NCF use NACKs as congestion feedback but with different purposes. NDP avoids congestion by applying per packet multihop load balancing. NCF dynamically tracks elephant flows and restricts them from using NACKs to notify the sender. Another difference is that NCF can be used either in data centers and Internetwide deployments, although there are no implementation results to evaluate the solution's effectiveness. Moreover, NCF solves the incast problem by separating the elephants from mice flows. NDP's principal limitation is the high retransmission rate and the moderately expensive resources allocated to implement the protocol. NCF sends network congestion information through the NACKs. The latter might produce unnecessary overhead, thus making it less attractive to operators. NDP and NCF require operators to manually tune predefined parameters such as thresholds, queue size, maximum latency, etc. Also NCF and NDP perform traffic isolation. The difference is that NDP requires end host modification to interact with a custom header, while NCF does not. NDP presents hardware implementations on different targets and customization, such as in <ref type="bibr">[74]</ref>. On the other hand, NCF is a proposal that requires further testing in P4 targets.</p><p>The method in <ref type="bibr">[42]</ref> uses TCP pacing. Pacing smooths throughput variations and traffic burstiness, maximizing the link utilization and minimizing queuing delays. A drawback of this method is that it produces unnecessary overhead when the number of flows is high. Also, the pacing approach by itself often has significantly worse performance than regular TCP due to the synchronization with packet losses <ref type="bibr">[98]</ref>. Therefore it can only be suitable for deployments that involve a small number of large flows (e.g., in Science Demilitarized Zone (Science DMZ) networks <ref type="bibr">[99]</ref>).   The reallocation module redistributes the queues between groups when it is required, and the apply actions module applies custom policies to ensure fairness among flows <ref type="bibr">[47]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.3.">Comparison with Legacy Approaches</head><p>Table <ref type="table">4</ref> compares the main characteristics between end hosts implementation congestion control schemes with legacy. P4 helps developers to improve TCP feedback by including standard (e.g., INT, ECN) and custom (e.g., pacing rate in the IP options) information into packets. On the other hand, legacy schemes only work with standard headers, which makes them less flexible.</p><p>ECN approaches experience moderate convergence times as they rely on a single bit to notify congestion without providing any information about the level of congestion or the congested link location. On the other hand, programmable switches can give better feedback to indicate congestion (e.g., INT metadata in HPCC), which results in higher detection accuracy, near-zero queueing delays, and faster convergence time. However, traditional congestion control algorithms' distributed nature allows them to operate without modifying the network infrastructure or the end hosts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">In-network Implementations</head><p>This category describes the implementations where P4 switches are used to modify the dynamics of the traffic and perform actions to reduce congestion (e.g., rerouting).</p><p>Many authors proposed schemes to reduce congestion as a response to the continuously increasing demand for higher bandwidth, lower latency, and higher reliability. P4 switches facilitate the implementation of such schemes, providing the programmer the complete control of the scheme's design. With high-speed networks, developing a good congestion control algorithm depends on how timely and accurate the actions to control congestion are. Therefore, a precise observation loop must complement the control loop <ref type="bibr">[83]</ref>. P4 switches provide the tools to extend congestion control algorithms and better understand the events in the network. Traditional congestion control identifies losses as a signal of congestion. However, other feedback signals contribute to improving the description of the congestion event (e.g., ECN, INT header, custom headers). With P4-programmable data planes, developers can define and manage how these signals are interpreted and create custom mechanisms to react against a broader range of congestion events. ed to lower delays, maximize throughput, and the incast problem (e.g., <ref type="bibr">[92,</ref><ref type="bibr">100]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1.">Literature Review</head><p>Turkovic and Kuipers <ref type="bibr">[47]</ref> proposed P4air, a method that classifies the traffic based on their congestion control algorithm. The flows are allocated in different queues to isolate them from the dynamic of the other congestion control algorithms, therefore improving interprotocol fairness and the network resources utilization in the presence of a large number of flows. Based on a previous study, Turkovic et al. <ref type="bibr">[101]</ref> observed the interaction among different congestion control, dividing them into categories, namely loss-based algorithms, delay-based algorithms, and hybrid algorithms. Results remark that a good TCP congestion control algorithm is not enough to achieve low latency, high performance and optimize the link utilization. The authors recommend grouping the flows according to their congestion control algorithm to ensure the expected resource optimization. The system actively monitors the characteristics of the flows that pass through a switch. Then, the switch groups the flows based on the type of congestion control (e.g., Loss-based, Delay-based, Loss-delay, Hybrid). Figure <ref type="figure">13</ref> illustrates the components of the scheme, which is divided into two subsystems: 1) A fingerprinting or identification process that identifies the congestion control algorithm considering the metadata of every packet, such as timestamps from different stages of processing, queue depth. 2) A stateful processing that tracks the reactions of the flows in the occurrence of certain events, such as loss or an increase in queue size. The scheme aims at enforcing fairness by combining these two subsystems. P4air identifies the congestion control algorithm by analyzing packets' metadata and performing actions (i.e., dropping packets, delaying a packet, changing the advertised congestion window in the TCP header) and observing the reaction of TCP flows. The system uses recirculation to reallocate packets after being enqueued to put them in a different queue if needed.</p><p>Apostolaki et al. <ref type="bibr">[49]</ref> presented a dynamic buffer sizing technique called Flow-Aware Buffer (FAB) sharing. The system addresses the problem of having multiple flows allocated in different egress ports. The algorithm assigns the flows according to their destination IP addresses to make more efficient use of the ASIC's memory. Moreover, it dynamically adjusts the buffer size for each output queue, which shows a considerable reduction of the packet loss and latency and improved FCT.</p><p>The authors remark that allocating buffers across ports is not trivial. Therefore, choosing the resource sharing algorithm can alter metrics such as the throughput (e.g., when TCP incast occurs). Traditionally, routing devices allocate buffer space considering the available memory, but they do not consider the traffic type. Results show improved FCT using Tail-Drop (TD) by order of magnitude compared to conventional buffer management techniques.</p><p>In another work, Turkovic et al. <ref type="bibr">[50]</ref> developed a congestion avoidance method that uses packets' metadata to track processing and queuing delays of latency-sensitive flows. The system is called Fast-TCP, and the main idea is rerouting traffic across a backup path when congestion occurs. Figure <ref type="figure">14</ref> depicts a scenario when the link between S1 and S3 is congested. S1 already has a backup path to reroute all the traffic to the same destination in such a situation. The authors created a P4 program to detect the congestion using digests to notify the participants when the delay surpasses a certain threshold. Experiments conducted on real hardware show decreased average and maximum delay and jitter compared to legacy approaches. This system aims to improve the QoS of time-sensitive applications such as audio, video, and haptic. The last one consists of transporting the sense of touch over the Internet. Such applications require very low latency (&#8764; 1ms), low jitter, high bandwidth, and high reliability.</p><p>Benet et al. <ref type="bibr">[46]</ref>, presented a scheme to load-balance Multi-Protocol TCP (MPTCP) sublfows. The scheme is called MP-HULA and associates MPTCP subflows to MPTCP connections by parsing MPTCP header fields. MP-HULA solves the particular case of load-balancing MP-TCP traffic by using the HULA data plane algorithm <ref type="bibr">[102]</ref>. MP-HULA comprises an adaptive probing mechanism that keeps congestion state for the best-k next hops per destination. The system then uses the congestion state to route different MPTCP subflowlets towards different paths. Flowlets are packet bursts used to split TCP paths without causing packet reordering. Results show the effectiveness of MP-HULA to reduce the FCT 1.7&#215; to HULA with MPTCP and uncoupled congestion control. In comparison to TCP, MPTCP combined with MP-HULA obtain from 2.9&#215; to 3.4&#215; reduced FCT.</p><p>Geng et al. <ref type="bibr">[51]</ref> proposed a congestion control protocol that can alleviate the problems caused by the Prioritybased Flow Control (PFC) mechanism. PFC is a linklevel flow control mechanism that allows operators to selectively pause traffic according to their class, aiming at preventing packet losses. The mechanism consists of pausing the transmission when a queue exceeds a threshold. Once the queue length is restored, the transmission resumes. Figure <ref type="figure">15</ref> explain the Head-of-Line (HOL) blocking issue. Consider switch S3. When congestion occurs (i.e., the queue size of the ingress port exceeds a threshold), the PFC algorithm will pause the transmission by sending a pause notification to switch S2 and consequently to switch S1 producing a cascading effect. Because PFC will stop all traffic from host h1 to host h3, traffic from host h2 to host h4 is also paused, even if the destination port that connects switch S3 to host h4 is not congested. Thus, the cascade effect of the pause frame can also escalate the congestion.</p><p>This problem usually occurs in data centers when applications that involve large-scale online data-intensive services use PFC and Remote Direct Memory Access (RDMA). Therefore, the overhead produced by the PFC can lead to performance degradation, the spread of congestion, and even a permanent loss of communication. The authors proposed a system called P4QCN to mitigate the HOL blocking issue produced by PFC. P4QCN is a congestion control scheme that uses programmable switches to implement an improved version of Quantized Congestion Notification (QCN), a flow-level, ratebased congestion control mechanism. The authors implemented the QCN standard in P4 to make it compatible with the IP-routed network. Results show that P4QCN achieves the expected performance in terms of latency and throughput while mitigating the cascade effect produced by the HOL blocking issue. Results show that P4QCN reduces the packet loss rate and latency when congestion occurs. P4QCN also achieves a better bandwidth utilization than PFC under the same conditions (e.g., same thresholds, packet losses).</p><p>Jinhao and Yue <ref type="bibr">[52]</ref> proposed a novel mechanism to regulate the size of the advertised TCP congestion windows. The authors argue that the current calculation of PL2, a rack-scale lossless scheme that achieves low latency and high throughput, which is independent of the workload and the transport protocol. PL2 reduces endto-end latency by proactively avoiding losses by implementing a protocol where the end host schedule a timeslot to start a data transfer. PL2 is transport-agnostic and capable of accommodating flows at 100Gbps line rates. Additionally, the scheme does not require any prior knowledge regarding the workload characteristics, given that it is hard to anticipate traffic and workload patterns. Figure 16 illustrates PL2's behavior. In this example, the sender is sending a packet burst to the receiver. 1) The sender starts sending an RSV (i.e., reserve) message to the switch to schedule a reservation. 2) The switch keeps time slot reservations for every host's input and output ports. It reserves the earliest available timeslot on the input port of the sender (i.e., T=4) and the receiver's output port (T=5).</p><p>3) The switch notifies the available timeslots to the sender with GRT (i.e., Grant) message.</p><p>4) The sender then transmits at the maximum of the two timeslots indicated in the GRT, T=5. The authors evaluated PL2 using several transport protocols. Results show that TCP flows experience low latency (i.e., &lt;25us) and high throughput even in noisy scenarios which involve background traffic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2.">In-network Implementation Comparison, Discussions, and Limitations</head><p>Table <ref type="table">5</ref> compares the schemes described in the previous section. P4air achieves significant improvements in fairness by applying traffic separation in comparison to current solutions. However, it requires allocating a queue for each congestion control algorithm group (e.g., loss-based (CUBIC), delay-based (TCP-Vegas), Hybrid (BBR), etc.). The latter might pose an issue due to the limited number of queues in switches, mainly used for QoS applications. A similar solution presented in <ref type="bibr">[49]</ref>. FAB separates the traffic into multiple queues according to the destination IP. In this way, the system performs a The method proposed in <ref type="bibr">[48]</ref> classifies the traffic according to its QoS requirements. Experimental results show that the proposed design effectively limits the maximum allowed rate and guarantees each flow's minimum bandwidth. However, this approach does not consider more than three traffic categories (i.e., guaranteed, best effort, drop), limiting flexibility as network traffic becomes more heterogeneous.</p><p>P4QCN presents a congestion feedback mechanism that enables switches to check the egress port for congestion before forwarding packets. Fast-TCP reacts in the presence of congestion by rerouting traffic through a backup path. In comparison to Fast-TCP, P4QCN reacts faster before congestion escalates. Both schemes require establishing predefined parameters such as thresholds, maximum queue length, congestion sensitivity, etc. However, Fast-TCP suffers from some performance penalties due to disabled caching and the lack of information about the queueing delay.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.3.">Comparison with Legacy Approaches</head><p>Table <ref type="table">6</ref> compares the main characteristics between P4-based and legacy in-network implementation congestion control schemes. P4-programmable data planes provide the tools to implement schemes that use various network metrics (e.g., queue length, packet losses, sequence, and acknowledgment numbers pairs and the metrics that can be inferred from the packet's header). Such inputs allow the programmer to have a better view of the network and perform more accurate actions to reduce congestion. Legacy devices are restricted to fixed-function, making it difficult to arbitrarily parse and modify TCP headers (e.g., AAW) or implementing custom congestion control mechanisms. Legacy devices can perform traffic separation using QoS features defined by the vendor. However, it requires configuring the QoS policies, which are limited to few categories. For example, separating real-time traffic such as video streaming from bulk data transfers can be performed using legacy devices. On the other hand, separating traffic according to its duration (e.g., splitting elephant from mice flows) or according to its congestion control is a more challenging task using legacy devices.</p><p>Moreover, the per-port buffer allocation observed in legacy devices does not consider the type of traffic. Therefore, this behavior resembles a dynamic buffer management technique limiting the buffer each queue can use to a fraction of the unused buffer size. Employing programmable data planes, the user can create a program that splits buffer cells among ports of a device while considering all ports' utilization and the expected benefit of buffering for each flow.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Summary and Lessons Learned</head><p>According to Li <ref type="bibr">[83]</ref>, a well-designed congestion control mechanism must guarantee adequate link utilization, including low-latency, near to zero-queue, robustness, and fast convergence. The capability of P4-programmable devices to perform packet processing at a line rate enables the design, test, and evaluation of precise congestion control algorithms. Moreover, programmable switches allow the development of custom congestion control algorithms that meet specific requirements such as isolating the dynamics of TCP flows, reducing latency, achieving better link utilization, and improving FCT. Congestion control schemes that use telemetry information to regulate the sending rate demonstrated reacting a the sub-millisecond level, making them suitable high-speed network environments such as science DMZ and data centers.</p><p>A limitation of INT-based schemes is that appending telemetry information reduces the payload size. There are new schemes that aim at reducing such overhead <ref type="bibr">[45]</ref>. Future work should focus on incremental deployment (i.e., inter-networking with legacy schemes) to allow more experimentation and mitigate current network issues. Schemes that perform traffic isolation present favorable results in managing the congestion. Such schemes assume that competing flows have different congestion control algorithms, which leads to unfairness problems. For instance, if a flow releases throughput (e.g., due to a congestion signal) then, another flow with a congestion control with faster adaptation can occupy the freed resource, leading to an unfair resource allocation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Active Queue Management (AQM)</head><p>The standard TCP strategy consists of regulating the sending rate once congestion occurs. Most of the traditional TCP congestion control mechanisms consider packet loss as a signal of congestion. A practical approach consists of avoiding congestion by informing the sending host that the network is close to experiencing congestion. This approach is implemented by AQMs, which consider that network switches are in a key position to detect and manage congestion (i.e., they know how much congestion the network is experiencing <ref type="bibr">[105]</ref>). Although incorporating new features to the switch was not the preferred way to introduce improvements, programmable data planes can change this paradigm.</p><p>AQM algorithms cooperate with end host protocols such as TCP to manage congestion by increasing network utilization and limiting packet loss and delay. Nadas et al. <ref type="bibr">[106]</ref> demonstrated that delegating all the management of the congestion to the congestion control algorithm is not suitable for heterogeneous networks. Therefore, congestion control algorithms are modeled by default without considering feedback from the routers. Moreover, TCP congestion control algorithms assume that the switch implements a FIFO queueing discipline and TD as the dropping policy. AQM algorithms operate within the routers along the path by taking action such as dropping packets, allocating flows into different queues, and managing the buffer spaces used to store packets. Therefore, the cooperation between TCP and AQM is the best treatment for congestion. However, despite such a relationship, TCP congestion control must understand the feedback information provided by switches.</p><p>6.1. Challenges on Designing AQM schemes 6.1.1. Understanding Queues Understanding the cause and meaning of queues is an important step to develop an effective AQM. The reason why network buffers should be employed is to smooth packet bursts. In other words, they act as shock absorbers that convert bursty packet arrivals into smooth, steady departures. Queues appear in buffers due to short-term mismatches in traffic arrival and departure rates, producing standing queues that cannot dissipate. The persistent problem of having a long-standing queue is also known as the bufferbloat problem. Standing queues creates considerable delays without improving the throughput. Usually, such an event is interpreted as congestion by traditional congestion control algorithms. BBRv1 and BBRv2 maintain a low queueing delay. However, in a recent study, Mishra et al. <ref type="bibr">[107]</ref>, measured that the Internet traffic is dominated by CUBIC (i.e., 36%), which is being known to produce bufferbloat <ref type="bibr">[108]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.2.">Dropping the Right Packets</head><p>Undersized buffers do not solve the bufferbloat issue, <ref type="bibr">[109,</ref><ref type="bibr">110]</ref> instead, they increase the packet loss rate. AQM algorithms solve the issue by discriminating which packets are suitable to be dropped thus, making a more efficient use of the switch's buffer. According to Nichols and Jacobson <ref type="bibr">[111]</ref>, high RTTs (i.e., &#8764;1 second) are caused by bloated buffers and are not related to the path characteristics. Therefore, designers should consider these variations to design a robust algorithm that can dynamically manage switch buffers. For example, the TD queueing discipline drops packets as a consequence of a full queue. AQMs such as RED <ref type="bibr">[8]</ref>, ARED <ref type="bibr">[112]</ref>, WRED <ref type="bibr">[113]</ref> require constantly adjusting tuning parameters as the network condition changes. Novel AQMs such as CoDel <ref type="bibr">[111]</ref>, and CAKE <ref type="bibr">[11]</ref> do not require tuning any parameters and were designed to adapt to changing network conditions. For instance, CoDel AQM drops packets that exceeded the standing queue interval (default value target=&#8764; 5ms). P4-programmable data planes comprise primitives and constructs to implement schemes that identify which packets are more suitable for dropping.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">RFC-standardized Algorithms</head><p>The following sections discuss the RFC-standardized AQM algorithms, which consist of implementing schemes that are standards defined by the Internet Engineering Task Force (IETF) (e.g., RED, CoDel, PI, PIE).</p><p>Traditional TCP congestion control presents a problem of adjusting the transmission rate only after a packet loss event. Therefore, there is a long time between a packet loss event and when the sender is notified about the loss <ref type="bibr">[114]</ref>. During this period, the end host continues transmitting at the same rate that the network cannot support thus, producing even more packet losses. AQMs aim at alleviating packet losses by notifying the sender about the incipient congestion. Therefore, end hosts can reduce their sending rate before the switch's queue overflows.</p><p>When a packet traverses the network, it might encounter four types of network delay, Figure <ref type="figure">17</ref> illustrates the types of network delays and where they occur.</p><p>&#8226; Processing delay: the time taken by the router to process a packet as it traverses through the parser, ingress and egress pipelines, and deparser.</p><p>&#8226; Queuing delay: the amount of time a packet is waiting in a queue until it can be transmitted.</p><p>&#8226; Transmission delay: the time required to put all the bits of a packet in the wire.</p><p>&#8226; Propagation delay: it is given as a function of the physical distance and propagation speed of a link.</p><p>The processing and transmission delay has little impact on the network delay. On the other hand, the propagation delay depends on the type of network (e.g., LANs, WANs). The queueing delay is a component that can significantly increase the delay that a packet experience across the network. Gettys and Nichols <ref type="bibr">[108]</ref> reported the presence of unnecessarily large buffers on the Internet that contribute to the queuing delay. Moreover, they also mention that routers that have an AQM enabled are rare.</p><p>The buffer size directly impacts the queueing delay. The buffer size directly impacts the queueing delay <ref type="bibr">[115]</ref><ref type="bibr">[116]</ref><ref type="bibr">[117]</ref>. The rule of thumb proposed by Villamizar and Song <ref type="bibr">[116]</ref> suggest that the amount of buffering (in bits) in a router's port should be greater or equal than the average Round-Trip Time (RTT) (in seconds) multiplied by the capacity C (in bits per seconds) of the port (i.e., B &#8805; C &#215; RT T ). The resulting buffer size will ensure that the link is fully utilized with a small number of TCP flows. However, Appenzeller et al. <ref type="bibr">[117]</ref> demonstrated that large static buffer sizes produce unnecessary high latency leading to a problem known as bufferbloat. They claimed that the buffer size could be reduced by a factor of the square root of the number of flows N (i.e., B &#8805; C &#215; RT T / (N ) when the number of TCP flows is large. In this way, the link is fully utilized and the latency reduced. 2) The algorithm actively measures the queueing delay to determine if a packet will be dropped.</p><p>The AQM algorithm adopted in many routers is TD, in which the maximum length of each queue is a fixed value. The packets from all sender are enqueued until the queue is full. All other incoming packets that found the queues full will drop. However, this approach presents several problems. For instance, TCP can rapidly fill the queue, causing a high loss rate to other flows.</p><p>Additionally, TD increases the delay since the queue can be persistently full. RED <ref type="bibr">[8]</ref> has been an improvement of TD. RED drops packets according to predefined thresholds. With the increase of the queue length, the probability of dropping packets also increases. However, RED only considers the queue length as an indication of congestion. It does not infer the number of flows sharing the bottleneck link, and it requires constant parameter tuning to achieve good performance. The latter is a problem when the network condition changes.</p><p>CoDel <ref type="bibr">[118]</ref> was proposed as an improvement of the existing AQMs. CoDel aims to control the queue length by dropping packets as a function of the sojourn time. If the queue length exceeds a threshold value (i.e., target) for some time (i.e., interval), CoDel starts dropping packets until the queue length settles below the threshold value. CoDel has been designed to work across a wide range of network conditions requiring little to no tuning.</p><p>Fair-queueing Codel (FQ-CoDel), <ref type="bibr">[119]</ref> mitigates the unfairness problem resulting from having multiple flows sharing a single queue. FQ-CoDel uses a hybrid approach that consists of two queueing disciplines: Fair-queueing and CoDel. This approach separates the dynamic of competing flows by allocating them in different queues. PIE AQM <ref type="bibr">[9]</ref> is a mechanism that controls the reference queue latency to a reference value based on the Proportionalintegral (PI) controller. It aims at providing low latency control, high link utilization, simple implementation, stability, and fast responsiveness. PIE does not require any tuning parameter, making it robust and optimized for various network scenarios. The recently proposed CAKE <ref type="bibr">[11]</ref> AQM controls the queue latency by using a rate-based shaper. This mechanism schedules the packet transmission at a precise interval using a virtual transmission clock. Similarly, CAKE does not require any tuning parameter.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1.">Literature Review</head><p>Kundel et al. <ref type="bibr">[53]</ref> implemented CoDel queueing discipline on a programmable switch. CoDel significantly reduces the latency of TCP connections and mitigates the bufferbloat problem. They also demonstrated that it is possible to implement AQMs in programmable data planes using approximation techniques. Figure <ref type="figure">18</ref> depicts the CoDel scheme that computes each packet's timestamp to establish the dropping policy. The algorithm considers three situations: 1) If the queueing delay is below a threshold, a packet is never dropped. 2) If the queuing delay exceeds the threshold for more than a predefined interval, packets will be dropped. 3) Packets will be discarded until the queueing delay is below a threshold. The same authors extended this work by implementing the CoDel algorithm in different P4-programmable data plane hardware targets <ref type="bibr">[54]</ref>. They also discuss the challenges of implementing CoDel in different targets, usually limited by hardware constraints. Evaluations compare the performance of the CoDel algorithm running in the Linux Kernel, Intel Tofino ASIC, and Netronome NPUbased SamrtNIC.</p><p>Kunze et al. <ref type="bibr">[55]</ref>, recently implemented and tested three versions of the PIE AQM in a P4 Tofino target. They show the inherent trade-offs of all the implementations and the impact on the performance. Precisely, the mismatch between AQM design and the capabilities of the Intel Tofino switch. The authors argue that conceptual challenges make it impossible to implement a fully RFC-compliant PIE version on Tofino. Also, the authors found that transferring RFC schemes to resourceconstrained P4 targets is a non-trivial task.</p><p>Papagianni and De Scheppe <ref type="bibr">[56]</ref> implemented Proportional Integral PI 2 AQM in a programmable data plane. PI 2 is an extension of PIE AQM described in <ref type="bibr">[120]</ref> that smoothes the non-linearities of the original PI controller. The main goal is to support the coexistence between classic and scalable congestion controls algorithms. The authors implemented the system in a P4 software switch (BMv2) and verified the design through a proof of concept implementation. Experimental evaluations report when the link capacity decreases, the queue latency is 10x slower, resulting in the TD case in a 10x higher delay (e.g., up to 200ms). On the other hand, the PI 2 controller keeps the queueing delay at the target levels. Furthermore, the authors argue that there is no need to tune the AQM parameters if the network condition changes.</p><p>Toresson <ref type="bibr">[57]</ref> presented a packet value-based AQM using programmable data planes. It uses a combination of the PIE AQM scheme, and the Per-Packet Value (PPV) <ref type="bibr">[121]</ref> concept to determine the dropping policy to reduce the queuing delay. PPV establishes the resources sharing policies by marking packets with packet values. This value determines the relative importance of one flow over another (e.g., different flows might have different throughput or delay requirements). The system was implemented using the Barefoot Tofino ASIC. Packet value statistics are collected through the P4 programmable data plane to maintain knowledge about pack-et value distribution. With the dropping probability calculated through the PIE AQM scheme, a decision can be made about which packets should be dropped. An evaluation shows that with the implemented PPV AQM, a low queu- ing delay can be achieved by dropping an appropriate amount of packets. It also shows that the PPV AQM controls the resource-sharing between different traffic flows according to a predefined marking policy. Sharma et al. <ref type="bibr">[122]</ref> implemented a queueing discipline called Approximate Fair Queueing (AFQ) entirely in the data plane. The algorithm is based on the Fair Queueing discipline that ensures a fair bandwidth allocation for each flow. The implementation leverages programmable data planes' ability to maintain the switch state on a per-packet basis, perform limited computations, and dynamically select the egress port. Results show that AFQ allows incoming flows to achieve a fair share immediately and isolates them from other concurrent flows, leading to significantly more predictable performance. The authors report that the average FCT is improved for short TCP flows. Also, AFQ enhances the performance of DCTCP by 2x and TCP performance by 10x for both average and tail FCT in scenarios with high loads.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.2.">RFC-standardized AQM Algorithms Comparison, Discussions, and Limitations</head><p>Table <ref type="table">7</ref> compares the surveyed schemes. RFC standardized AQMs can be described in P4, although some are not very precise due to the approximations techniques they use. For instance, in <ref type="bibr">[53,</ref><ref type="bibr">54]</ref> the CoDel algorithm uses simple arithmetic and packet counting to approximate the square root operation. However, results show that these limitations do not significantly impact the performance. Kunze et al. <ref type="bibr">[55]</ref> demonstrated that implementing an RFC-standardized AQM in a programmable switch is constrained due to the resource limitation (i.e., the number of bytes allocated to queues) and the complexity of arithmetic operations such as multiplication and square root. They also described general implications for AQM design in hardware ASICs. First, gathering queueing information at the ingress facilitates the implementation of AQMs such as RED and PIE in the ingress pipeline. Second, reducing the complexity of data plane calculation makes the AQM control-law simple, so there is no need to rely on the complex operation. Flexible memory access allows early read access, and then one later write access enables the calculation of more complex control rules. Third, when designing an AQM, an AQM that is tunning-free or has concise and price ways to adopt parameters to new settings is preferred. Fourth, AQM algorithms should avoid underutilizing ASIC resources so that you fully leverage the capabilities of the switch. Finally, the authors conclude by remarking on a need for a general framework to directly manage the queue with P4.</p><p>Sharma et al., <ref type="bibr">[58]</ref> proposes a solution based on building blocks that mask arithmetic operations in the data plane. The goal is to provide the functional blocks (e.g., extern components) to perform complex arithmetic operations (e.g., division, multiplication, square root.) without using approximations or lookup tables (e.g., approximating the square root function can be achieved by counting the number of leading zeros through longest prefix match). These features can enable the development of more complex AQMs in the data plane.</p><p>With P4, programmers have access to packet metadata such as the queue depth, queueing time variations, ingress and egress timestamps, and packet length to implement AQM schemes. Although some packet metadata are only available at the egress control block (see Figure <ref type="figure">2</ref>), the programmer can use cloning and recirculation primitives to reprocess packets that already include queueing information. Another challenge when implementing AQMs is that The use of complex arithmetic operations is constrained in P4 hardware targets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.3.">Comparison with Legacy Approaches</head><p>Table <ref type="table">8</ref> presents the limitations of the surveyed RFCstandardized AQMs. Several RFC-standardized algorithms were proposed since the first implementation of RED <ref type="bibr">[8]</ref>. Many of the RFC-standardized AQMs are implemented in Linux as queueing disciplines. However, they can only be used with software switches. RFCstandardized can be implemented in P4 programmable data planes to complement and improve existing deployments.</p><p>On the other hand, legacy devices implement out- </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Custom AQM Algorithms</head><p>Researchers have focused on improving TCP and the overall network performance by proposing novel AQM schemes to mitigate congestion and bufferbloat. With the advent of P4, the description, validation, and evaluation of custom AQM algorithms became faster. Although P4 was not designed to manage buffers and packet scheduling, developers can combine queueing disciplines and dropping policies to create AQMs that are more flexible compared to RFC-standardized AQMs. However, the match-action pipeline constraints, such as hard limits on multiplications and memory accesses or access to the queue state, can limit some schemes' implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.1.">Literature Review</head><p>Mushtaq et al., <ref type="bibr">[59]</ref>, argue that the traditional approach where end host-based congestion control algorithms respond to simple congestion signals from the network has a negative impact if the goal is to reduce the FCT. Therefore, they propose Approximate and Deployable SRPT (ADS), which uses approximations techniques to implement the Shortest Remaining Processing Time (SRPT). SRPT <ref type="bibr">[124]</ref> is a queueing discipline that prioritizes flows with fewer outstanding bytes in the queue (i.e., with less remaining time). The authors claim that to minimize the average FCT in data centers, ADS can manage congestion. Their evaluations demonstrated that ADS achieves a performance close to SRPT, which has an advantage over the simple FIFO scheduling policy. Their approach leverage a small number of priority queues that require only software changes in the host and the switches.</p><p>Menth et al. <ref type="bibr">[60]</ref> proposed a system called Activity-Based Congestion (ABC), based on a domain-based QoS mechanism. This scheme aims at providing more fairness among customers on bottleneck links using scalable Figure <ref type="figure">20</ref>: SP-PIFO approximates the behavior of PIFO queues by adapting how packet ranks are mapped to priority queues. Strategy A presents a sub-optimal solution. By re-assigning the ranks, the system produces an optimal solution (e.g., Strategy B) <ref type="bibr">[61]</ref>.</p><p>bandwidth-sharing mechanisms for congestion management. Figure <ref type="figure">19</ref> depicts the system overview. Ingress nodes work as activity meters to measure the traffic rate that enters the ABC domain. They set an activity value for each packet and mark that value in its header. Such an aggregate may be, e.g., the traffic of a single user or a user group. Thus, ingress nodes require traffic descriptors for any aggregate that should be tracked. Ingress nodes and core nodes of an ABC domain are forwarding nodes. They use an activity AQM on each of their egress interfaces within the ABC domain to perform an acceptance decision for every packet. That means they decide whether to forward or drop a packet depending on its activity, enforcing fair resource sharing among traffic aggregates within an ABC domain. Egress nodes remove the activity information from packets leaving the ABC domain.</p><p>Alcoz et al. <ref type="bibr">[61]</ref> tried to approximate the behavior of PIFO (Push-In First-Out) using programmable data planes. PIFO is a queueing discipline that allows enqueued packets to be pushed in arbitrary positions according to the packet's rank. However, PIFO implementation in hardware is challenging because sorting packets at line rate arbitrarily is not a trivial task. The authors leveraged the capabilities provided by programmable data planes to implement PIFO using approximations. The system is called SP-PIFO (Strict Priority PIFO). It consists of mapping between packet ranks and Strict Priority (SP) queues to minimize scheduling mistakes relative to the ideal PIFO implementation. Figure <ref type="figure">20</ref> depicts how PIFO works and its approximation scheme (SP-PIFO) where the number inside the boxes specifies the priority of the packets. PIFO schedules incoming packets serving them in a sorted way. Results show that SP-PIFO minimizes the FCT while enforcing fairness among competing flows. Although the scheme approximates PIFO behavior and matching it in many cases, the system cannot guarantee to perfectly emulate the behavior of a theoretical PIFO queue for all ranks.</p><p>Cascone et al. <ref type="bibr">[62]</ref> proposed a P4-based system to mitigate the problem of a network switch enforcing fair bandwidth sharing of the same link among TCP and non-TCP senders. The scheme is called FDPA (Fair Dynamic Priority Assignment), based on the Fair Queuing (FQ) scheduling algorithm. The mechanism of FDPA consists of giving priority to the packets belonging to a user whose arrival bit-rate is equal to or less than its fair share over those users generating traffic at higher rates. However, FDPA does not provide precise bit-level or packet-level fairness. Instead, it approximates a fair share over longer timescales (e.g., in the order of few RTTs). Figure <ref type="figure">21</ref> shows the system's pipeline. Packets are first classified per user and then processed by a rate estimator, which measures the specific user's arrival bit rate. Then, packets are allocated in a priority queue Q n , such that the higher is the arrival rate, the lower the priority will be.</p><p>A strict priority (SP) scheduler serves queues in priority order: packets of priority are dequeued only if all other queues with higher priority are empty, where q = 1 is the highest priority. Experiments conducted at 10Gbps show that the performance of FDPA approximates the ideal Deficit Round-Robin (DRR) with dedicated peruser queues. However, FDPA doe not require implementing per-user queues.</p><p>Harkous et al. <ref type="bibr">[63]</ref> propose a scheme that uses the target's match-action tables to implement virtual queues in P4-programmable data planes. The scheme aims to provide customizable traffic management by establishing how packets are served to the non-programmable traffic manager. The system works by allocating different network slices into virtual queues (vQueues) depending on the rules defined in the control plane (e.g., limiting the rate, controlling the delay, prioritizing flows). A slice is a unique logical, virtualized network built on top of a shared network. The scheme employs match-action tables to perform traffic management such as rate limiting and queue management. The system assumes that slice priorities are calculated and pushed top-down from the control plane to the data plane. Results show that policing approaches implemented in vQueues are effective in controlling TCP throughput per slice.</p><p>In recent work, Turkovic et al. <ref type="bibr">[64]</ref> presented a scheme that implements a QoS-based forwarding called P4QoS. It allows applications to establish the QoS requirements by appending its condition to the packet header. The system combines information added by the application and the Link Latency Estimation Protocol (LLEP), which mitigates the lack of coordination among the clocks of the switches that add timestamps to the packet. Figure <ref type="figure">22</ref> shows how LLEP works. In step 1, the ingress switch S1 appends an LLEP header, containing an egress timestamp representing the time the packet left the switch, t send . In step 2, switch S2 saves its ingress timestamp, representing the time the packet was received. Then, it clones the packet, and forwards the original application data packet further along the path, and calculates the processing plus queueing delay (t p + t q ), which is the difference between the current egress and the saved ingress timestamp. Next, the information is added to the LLEP header. In step 3, the information produced in step 2 is sent back to the previous switch. Finally, in step 4, switch S1 uses the new information when the cloned packets are received to calculate the estimated link latency LL est . With this new value, the switch updates the latency register at the ingress switch to estimate the latency register at the ingress switch is updated and used to estimate the packet deadlines for all the flows on the same output link. All the process is repeated periodically. Additionally, the authors developed a custom mechanism to determine the maximum allowed queuing delay satisfying the deadlines of all flows processed at the switch. Results show that P4QoS can significantly improve the performance in the network, particularly for low-latency traffic, by significantly reducing the number of marked packets without affecting the throughput.</p><p>Chen et al. <ref type="bibr">[125]</ref> proposed QoSTCP, a P4-based scheme that enforces QoS policies on TCP flows. The system uses meters and Rate-Limiting Notification (RLN) to smoothly adjust the TCP congestion window. QoSTCP requires modifying the end-host congestion control to understand the RLN notification, which indicates the growth rate of the congestion window before reaching a threshold. In the switch, QoSTCP implements a Two Rate Three Color Marker meter (trTCM) and ECN using P4. Results show that QoSTCP prevents high variations in the throughput and reduces the average number of packet losses.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.2.">Custom AQM Algorithms Comparison, Discussions, and Limitations</head><p>Table <ref type="table">9</ref> compares the schemes described in the previous section. FDPA and SP-PIFO use approximations in the data plane to implement a theoretical queueing model. Both schemes enforce fairness while minimizing the FCT. Although using approximations in the data plane can be inaccurate depending on the demands of the theoretical model, both schemes obtain good results. P4QoS and <ref type="bibr">[63]</ref> leverage on multiple queues to implement custom policies. P4QoS requires tuning at least four parameters (i.e., ) to define the slices. In contrast, <ref type="bibr">[63]</ref> does not require any tuning. Both schemes separate traffic, however, P4QoS enables applications to request specific QoS to the network (i.e., latency deadlines). On the other hand, the scheme in <ref type="bibr">[63]</ref> overcomes a data plane architectural limitation by implementing virtual queues in the P4 pipeline. This solution is portable to other architectures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.3.">Comparison with Legacy Approaches</head><p>P4 provides flexibility in the creation of custom AQMs. Therefore, more features can be added to programmable data planes compared to legacy devices. For instance, the behavior of schemes such as RED (includ- ing its variant ARED, and WRED) depends on tuning some parameters. Those parameters require adjustment depending on the network conditions. Therefore, the scheme becomes hard to manage and less autonomous. Operators continue using just the simple TD (i.e., no AQM). More recent AQMs such as FQ-CoDel and CAKE present fewer parameters, making them easier to manage. P4 approaches address many of the shortcomings observed in legacy AQMs. P4-based AQMs can dynamically adjust their parameters based on the data plane's measurements. With improved feedback, the switches and end hosts have a better view of the events in the network. Thus, they can perform more precise actions to network events.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4.">Summary and Lessons Learned</head><p>The surveyed works agree that packet losses are an inaccurate indicator of congestion, suggesting that AQMs should complement congestion control schemes. Improving TCP congestion feedback via custom AQM schemes provides an efficient solution to the bufferbloat issue and other TCP performance metrics such as link utilization, fairness, and FCT. Most of the AQM summarized in previous sections consider packet loss or ECN as a congestion signal, making them TCP-oriented AQMs.</p><p>AQMs such as CAKE and FQ-CoDel, do not require an active adjustment of any parameters. Instead, they  have an integrated dynamic feedback mechanism to ensure a low queueing delay. With P4, developers can implement RFC-standardized AQMs in a top-down approach and control the system's behavior under different network conditions on a per-packet basis. The use of approximations in the data plane demonstrated to be a feasible solution to match with theoretical models. Additionally, P4 facilitates measuring the queue metrics via the standard metadata and allows reprocessing packets invoking primitives such as resubmission and recirculation. These features enable programmers to find alternatives to using complex arithmetic operations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">TCP Offloading</head><p>The Network Interface Card (NIC) plays a vital role in network performance. Currently, network processing from low to medium speeds (&lt;10Gbps) is possible through standard Operating System (OS) drivers and multicore processors available in the market. However, to process speed rates around 100Gbps, servers must bypass the kernel to save CPU cycles. Offloading processing and network functions to the NICs (e.g., SSL and TCP segmentation offloading), <ref type="bibr">[126]</ref>, can release the CPU from some network stack processes. Thus, improving the overall performance on the server. However, offloading operations come with some constraints, such as application data availability usually served by the generalpurpose CPU. Table <ref type="table">10</ref> compares programmable and non-programmable NICs. P4-programmable NICs offer great flexibility to perform custom packet processing. The increasing gap between CPU capacity and network bandwidth makes it challenging to ensure TCP's desirable properties and save CPU cycles in the server. In <ref type="bibr">[127]</ref>, Mogul identified the complexities that legacy NIC faces to offload TCP in practice. A persistent problem described in his work is currently observed in short-lived TCP connections, which are prevalent in data centers and WANs <ref type="bibr">[128]</ref><ref type="bibr">[129]</ref><ref type="bibr">[130]</ref><ref type="bibr">[131]</ref>. Short-lived TCP connections produce significant overhead because the TCP stack must maintain the correct protocol behavior independently of the application implemented over it. Jacobson <ref type="bibr">[132]</ref> proposed using header prediction to process common TCP scenarios using few instructions. Offloading TCP operations using P4-programmable NICs obtained more relevance in the last few years as they can used for custom packet processing and offload server operations. Based on the flexibility provided by P4, many authors implemented schemes that offload transport protocols such as TCP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.2.">Implementing Transparent Application Offloading Schemes</head><p>Offloading an application to the NIC without significantly modifying the original code is a challenging task that requires the development of compilation tools. Fully offloaded application schemes are constrained due to the intrinsic requirements of each application. Therefore, developers can decide which features are more suitable to be handled by the NIC. Sapio et al. <ref type="bibr">[133]</ref> argue that a transparent in-network computation must be implemented with caution. Programmers have to identify what type of computation can be performed in the network to reduce communication overhead.</p><p>Several works propose using P4-programmable NICs to support and run microservices. The microservice architecture consists of an application comprising smaller components coupled independently over well-defined APIs. Although modern non-programmable NICs partially support TCP offloading operations (e.g., checksum calculation, TCP Segmentation Offload (TSO), and Large Receive Offload (LRO)), they mainly benefit large data transfers. P4-programmable NICs allow programmers to define whether an operation can be fully or partially offloaded to reduce the timescale of a process. P4programmable NICs can perform computation closer to the network edge, offering a better bandwidth/cost efficiency than legacy approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.">Protocol Offloading</head><p>According to <ref type="bibr">[136]</ref>, TCP protocol processing becomes the dominant overhead comparable to application processing and other systems overhead. Therefore, to reduce this trend, the network nodes' hardware must handle this issue to fulfill the speed requirements. The origin of this Conventional operating systems implement all network processing in the kernel space. (b) Kernel bypass architectures avoid overheads for kernel crossings by moving the protocol implementation into the application and providing applications with direct access to the NIC for sending and receiving packets <ref type="bibr">[134]</ref>. (c) Hardware accelerated data-path architecture requires a target-specific application <ref type="bibr">[135]</ref>.</p><p>issue resides in the TCP stack design, which is implemented in the end devices and traditionally handled by the OS kernel. Consequently, allocating a considerable amount of computing resources results in a reduced network packet processing efficiency. Therefore, a software approach for handling packet processing cannot satisfy the increasing network transmission speed.</p><p>Several efforts have been made to optimize TCP packet handling by improving what is known as TCP Offload Engines (TOEs) adapters. A TOE refers to a dedicated network function that handles a significant part of the TCP protocol in hardware to offload a generalpurpose CPU from TCP processing. TOE devices can parse and remove TCP headers from incoming packets before transferring the data to the host, which reduces the burden of processing data in the kernel memory. A TOE device can buffer incoming packets and transfer them directly to the application's buffer bypassing the kernel. Similarly, the application can directly transmit data via the TOE's buffer instead of being held in the kernel's buffer.</p><p>Figure <ref type="figure">23</ref> illustrates the differences among a commodity, a kernel bypassing OS architecture, and a hardware accelerated data-path architecture. Traditionally, the OS handles all network operations (e.g., implementing TCP congestion control, encapsulating packets, and checking protocol correctness) to facilitate data exchange between applications. On the other hand, bypassing the kernel significantly reduces overheads and permits applications to define how network processing is performed. Therefore, the processing time is shorter and more deterministic, which enables applications to get better performance. However, developers should implement all the network protocols in the user space, which is not scalable. P4programmable NICs can facilitate the implementation of TCP functionalities such a congestion control algorithms and permit transparent application development in the userspace. The hardware-accelerated data-path archi-tecture enables target-specific applications that run with high performance. One of the most popular accelerated data-path architecture is the Data Plane Development Kit (DPDK) <ref type="bibr">[137]</ref>. DPDK is based on a run to completion model for packet processing. A run to completion model refers to a scheduling model that runs a task until it finishes. In DPDK, execution units such as CPU cores and memory are reserved before calling data plane applications. However, the latter implies that DPDK applications require a dedicated CPU core to poll packets, limiting the number of programs running on the architecture. architecture.</p><p>The DPDK implements a run to completion model for packet processing, where all resources must be allocated prior to calling Data Plane applications, running as execution units on logical processing cores.</p><p>NICs rapid evolution enabled the chances to support new features to offload the general-purpose CPU from network processing. NICs' role in modern server systems became more relevant than merely transferring data between the server's CPUs and the network link. Modern NICs comprise advanced features, namely protocol offloading, packet classification, rate limiting, virtualization, and cybersecurity. This type of device is referred to as smart NICs. Han et al. <ref type="bibr">[138]</ref>, identified three factors that contributed to this trend: 1) New applications whose performance requirements cannot be achieved with legacy networking stacks, requiring hardware augmentation, 2) The increasing adoption of virtualization, which requires the NICs to support switching functionalities, and 3) The growth of multitenant data centers and cloud computing where network isolation is necessary. However, the rapid evolution of smartNICs presented some challenges, such as the hardware complexity and the type of applications supported by the platforms.</p><p>The advent of P4-programmable NICs made it possible to achieve better network and processor utilization. P4-enabled NICs facilitate implementing schemes involving traffic encapsulation, load balancing, and security protocols (e.g., TLS handshaking, IPSec, and other encryption schemes) due to their flexibility and reduced complexity. Modern NICs such as Pensando <ref type="bibr">[139]</ref>, Netronome <ref type="bibr">[140]</ref>, Xilinx <ref type="bibr">[141]</ref> and Innovium <ref type="bibr">[142]</ref> have fully P4-programmable, energy-efficient, multi-core processors on which many packet processing functions, including a full-blown programmable switch, can run.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.1.">Literature review</head><p>Moon et al. <ref type="bibr">[65]</ref> identified two scenarios where TCP servers suffer from poor performance: when handling short-lived flows and handling Layer 7 proxy (L7 proxy). Short-lived flows suffer from severe overhead in processing small control packets, and L7 proxy demands extensive CPU usage and memory allocation for handling packets between two connections. To solve both issues, the authors presented a hardware-assisted TCP stack architecture implemented using programmable NICs. The system is called AccelTCP, and it offloads TCP complex operations by reducing the CPU cycles used by the application's processes. Results show that short-lived connections perform similarly to persistent connections. It also significantly improves the performance of Redis <ref type="bibr">[143]</ref>, and HAProxy <ref type="bibr">[144]</ref> which are L7 load balancers.</p><p>Yan et al. <ref type="bibr">[66]</ref> used P4-enabled NICs to provide a solution designed and implemented to meet the 5G networking requirements. The authors described the design, implementation, and experimental results of P4-enabled smart NICs, and it is used to perform network slicing. Network slicing is a 5G terminology used to describe the sub-components of a network. These sub-components are referred to as the IP and optical network and their way of transport. The authors implemented a SmartNIC, a P4 program that inserts Segment Routing Multi-Protocol Labe Switching (SR-MPLS) header in ingress and deletes in egress. The system presented results showing that the SmartNIC can achieve a maximum of 84.8Gbps utilizing only one CPU core. With P4 SR-MPLS SmartNIC header insertion, the bandwidth performance can be up to 30% higher than legacy approaches.</p><p>Harkous et al. <ref type="bibr">[67]</ref> present P8, a method to estimate the packet forwarding latency using P4 programmable NICs and DPDK-based software switch. The authors analyze the impact of different P4 constructs on packet processing latency for three state-of-the-art P4 devices: Netronome SmartNIC, SUME NetFPGA, T4P4S DPDKbased software switch. Results reveal that the forwarding latency varies depending on the P4 constructs that are adopted.</p><p>Qiu et al. <ref type="bibr">[68]</ref> developed a tool called Clara, which aims to help developers evaluate the performance of a Network Function (NF) before offloading it to a Smart-NIC. Clara takes as an input the original unsupported NF and predicts its performance after offloading it. Also, it can determine whether to offload an NF or not and how to perform an effective port. The system facilitates the developer to evaluate offloading strategies, obtain performance insights, and identify suitable Smart-NIC models for specific workloads. The downside is that the validation was limited to a particular SmartNIC (i.e., Netronome Agilio CX).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.2.">Protocol Offloading, Comparison, Discussion and Limitations</head><p>Table <ref type="table">11</ref> compares the schemes described previously. Performing TCP stateful operations in the NIC requires maintaining consistency of transmission control blocks in the NIC and the server. This issue occurs due to target constraints. The operations on the NIC's stack deviates from the state of usually another operation, which means that the P4 program must consider the target hardware architecture. Moreover, stateful TCP processing increases the complexity and resource consumption of the implementation running in the NIC. For example, TCP SYN packets require flow state initialization and setup. Therefore, depending on the packet size, such an operation might take a non-deterministic amount of time. This variation in the processing time occurs due to the packet length and/or checksum computation <ref type="bibr">[68]</ref>. In <ref type="bibr">[145]</ref>, Grey et al. highlighted the gap between the expected and actual behavior in P4-programmable devices. The authors discussed the consequences of only relying on the network programming language when ignoring the limitations imposed by the target's hardware. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.3.">Comparison with Legacy Approaches</head><p>Legacy NICs performs TCP offloading operations such as the TCP Segmentation Offload (TSO) <ref type="bibr">[146]</ref>, and Generic Receive Offload (GRO) <ref type="bibr">[147]</ref>. However, these methods do not support complex policies such as TCP flow state tracking and Single Root Input/Output Virtualization (SR-IOV). SR-IOV leverages the NIC to bypass the hypervisor and send packets directly to the virtual machine <ref type="bibr">[148]</ref>. P4-programmable NICs can support custom implementations, and thus deliver more sophisticated features to perform advanced operations on packets. Therefore, P4-programmable NICs are appropriate to optimize network performance in data centers. Consider the example of AccelTCP <ref type="bibr">[65]</ref> which aims at offloading TCP startup and teardown. The system is fully implemented on a P4-programmable NIC rather than as a server-based appliance. Traditional NICs (including hardware and virtual NICs) do not have such flexibility, restricting the schemes that can be implemented.</p><p>Moreover, in some cases, the NIC becomes a bottleneck <ref type="bibr">[149]</ref> due to the physical input/output operations that involve processes copying packets from NIC to upper layers. Another existing drawback present in traditional NICs that perform offloading operations is the lack of network visibility. This issue happens because the operations are performed out of the OS memory space. P4programmable NICs solve this issue by providing a clear and well developed interface with the control plane that allows programmers to query data from the data plane without degrading the performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.">Application Offloading</head><p>In the last two decades, network speed became significantly faster than CPU performance, forcing many data center applications to sacrifice application cycles for packet processing. The literature shows that there is a gap between a general-purpose CPU and a network processor ASIC. Exploiting SmartNICs capabilities to improve TCP performance is a challenging task due to the network protocol stack design. This means keeping the protocol's properties regardless of what the application does, which is usually a significant challenge when bypassing the kernel to send the application's data to the NIC directly.</p><p>As a result, developers have started to offload computation to programmable NICs, dramatically improving the performance and energy efficiency of many data center applications, such as search engines, key-value stores, real-time data analytics, and intrusion detection. However, implementing data center network applications in a combined CPU-NIC environment is challenging. It often requires many design-implement-test iterations before the accelerated application can outperform its CPU-only version. These iterations involve non-trivial changes: programmers may have to move portions of application code across the CPU-NIC boundary and manually refactor the program.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.1.">Literature Review</head><p>Choi et al. <ref type="bibr">[70]</ref> present &#955;-NIC, a system that aims at offloading serverless workloads (i.e., &#955;s) to SmartNICs. Serverless workloads are referred to as fine-grained functions with short service times and memory usage. If correctly balanced and distributed, serverless applications can be more efficient and cost-saving than buying or renting a fixed quantity of servers, which generally involves significant periods of underutilization or idle time. Developers can take advantage of the flexibility provided by serverless computation, focusing solely on creating custom programs (i.e., &#955;s) without worrying about the infrastructure they use. However, many of these workloads, which are small functions, run on virtual machines. Virtualization technology becomes inefficient (i.e., processing delays and memory overheads)in these cases where the service's cost depends on the usage. The architecture of a typical server CPU cannot handle thousands of simultaneous threads. Therefore, when a function interrupts the CPU, it has to store the current process's state Wire-to-Wire Latency=65ns RISC-V CPU Figure <ref type="figure">25</ref>: The nanoPU design. The NIC includes ingress and egress PISA pipelines, a hardware-terminated transport, and a core selector with global RX queues; each CPU core is augmented with a hardware thread scheduler and local RX/TX queues connected directly to the register file. Total wire-to-wire latency is 65ns <ref type="bibr">[74]</ref>.</p><p>and retrieve it back, wasting tens of milliseconds of CPU cycles. The authors used SmartNICs, or more specifically ASIC-based NICs, to run these lambdas more efficiently. ASIC-based NICs consist of hundreds of Reduced Instruction Set Computer (RISC) processors (i.e., NPU) with independent local instruction storage and memory. Consequently, these SmartNICs-unlike GPUs and FPGAs, optimized to accelerate specific workloads, are more suitable for running many discrete functions in parallel at high speed and low latency. Consider Figure <ref type="figure">24</ref>, &#955;-NIC implements a Match-Lambda programming abstraction similar to the Match-Action Table (MAT) abstraction but, in this case, the &#955;s performs more complicated operations.</p><p>In this model, &#955;s are completely independent programs that do not share states and are isolated from each other. The matching stage operates as a scheduler that forwards packets to the matching lambdas or the host OS. In the last stage, a parser handles packet operations (e.g., header identification), and lambdas operate directly on the parsed headers. Therefore, the abstract machine model provides developers with an optimal way to run serverless workloads in parallel without any other interference. The authors developed an open-source implementation of &#955;-NIC using P4-enabled SmartNICs. They also implemented the methodologies to optimize &#955;s to utilize the SmartNIC resources efficiently. Results show that &#955;-NIC achieves a maximum of 880x in workloads' response latency and 736x more throughput reducing the main CPU and memory usage.</p><p>Ibanez et al. <ref type="bibr">[74]</ref> presented a scheme called nanoPU, which minimizes the tail-latency for Remote Procedure Calls (RPC). The system mitigates the causes of high RPC tail latency by bypassing the cache and memory hierarchy. The nanoPU directly places arriving messages into the CPU register file. Figure <ref type="figure">25</ref> shows the block diagram of the nanoPU. Notice that all the elements are implemented in hardware. The system implements NDP <ref type="bibr">[41]</ref> due to its low-latency performance. A message buffer serves packets at line rate and a core selection algorithm responsible for allocating incoming processes to idle cores. The system supports a priority thread scheduling algo-rithm that supports up to four threads and a register file interface. Finally, the scheme has a hardware/software interface to debug and test custom programs. Results show that the wire-to-wire latency through the application is just 65ns, about 13&#215; faster than the current stateof-the-art approaches. The proposed hardware transport layer responds faster than software, leading to a tighter congestion control loop between end-points. Additionally, nanoPU can handle many simultaneous RPC communications that scale linearly with the number of outstanding messages rather than hosts in a data center.</p><p>Gao et al. <ref type="bibr">[71]</ref> proposed a system called OVS-CAB that applies an efficient rule-caching to offload an Open Virtual Switch (OVS) to a SmartNIC. The system can handle traffic at a high line rate and offload rule lookup functions SmartNIC keeping a low CPU usage. Figure <ref type="figure">26</ref>   <ref type="figure">26</ref>: System overview. OVS-CAB hardware offloading system consists of three principal components: 1) A P4-based hardware table pipeline that serves as the hardware data plane, 2) A userspace control plane module that implements OVS-CAB rule-caching logic, and 3) An RPC-based message system that communicates between the data plane and the control plane.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Update TCP Context</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Raise Flow Events</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Update TCP Context Payload Reasembly</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Raise Flow Events</head><p>Sender TCP Stack</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Receiver TCP Stack</head><p>Incoming Packets</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Outgoing Packets</head><p>Figure <ref type="figure">27</ref>: Packet processing steps in mOS. Upon packets' arrival, mOS first updates its TCP context for the packet sender. Then, it sends the records of all flow events that must be triggered <ref type="bibr">[150]</ref>.</p><p>selected rules, the system achieves a high data plane hit rate while using low memory. Results show that OVS-CAB improves the throughput of OVS and significantly reduces the host CPU utilization. Kohler et al. <ref type="bibr">[72]</ref> developed an in-network computing system for Complex Event Processing (CEP). CEP is a method that consists of tracking and analyzing streams of information about events to infer conclusions from them. This method requires real-time processing to extract the data from events such as text messages, social media posts, stock market feeds, traffic reports, weather reports, or other data types. The authors demonstrate that it is feasible to express CEP operations in P4 and develop a tool to compile CEP operations. Furthermore, they identified challenges and problems to suggest future research directions for implementing full-fledged in-network CEP systems.</p><p>Mohammadkhan et al. <ref type="bibr">[73]</ref> offloaded NFVs by using P4-programmable NICs. NFVs are widely adopted in large-scale enterprise and data center networks to deliver more complex and diverse in-network processing. The scheme is called P4NFV and aims at determining the partitioning of the P4 tables and optimal placement of NFs to minimize the overall delay and reduce resource utilization. P4NFV offers a unified host and NIC data plane environment to the SDN controller.</p><p>P4NFV's framework considers both host and NIC capabilities to provide a unified view of a P4-capable NFV processing engine. For example, in <ref type="bibr">[151]</ref>, the authors describe how NFs can use high-performance userspace TCP stacks to simplify operations such as TCP byte stream reconstruction. This relatively heavyweight function involves copying packet data into a buffer, incurring unnecessary latency. This latency is due to the processing delay of NFs chain that perform TCP byte stream reconstruction using mOS <ref type="bibr">[150]</ref>. mOS is a reusable networking stack for stateful flow processing in middlebox applications. Figure <ref type="figure">27</ref> shows the packet processing steps of mOS. Upon packets' arrival, mOS first updates its TCP context for the packet sender. Then, it sends the records of all flow events that must be triggered. Results with P4NFV demonstrate that as the chain length increases, the latency for the NFs performing TCP processing increases significantly compared to layer-2 NFs. This additional latency can be avoided by performing TCP processing only once and then exposing the resulting stream to the sequence of NFs.</p><p>Pontarelli et al. <ref type="bibr">[69]</ref> presents FlowBlaze, an abstraction that extends match-action languages such as P4, to simplify the description of a large set of layer 2 to layer 4 stateful functions, making them available for implementations on FPGA-based SmartNICs. FlowBlaze adapts match-action tables to describe the evolution of a network flow's state using Extended Finite State Machines (EFSM). The framework aims to provide a system that allows a program with little hardware design expertise to implement quickly, update, stateless, and stateful packet processing functions at high speed on FPGAbased SmartNICs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.2.">Application Offloading Comparison, Discussions, and Limitations</head><p>Table <ref type="table">12</ref> compares the schemes described in previous sections. P4-based NICs are continually evolving to support new features. The role of NICs in modern server systems has changed. NICs can now host advanced features, such as protocol offloading, packet classification, rate limiting, cryptographic functions, and virtualization. Jang et al. <ref type="bibr">[138]</ref> identified three factors that contributed to this trend: 1) The development of new applications whose performance requirements cannot be met by legacy networking stacks, demanding hardware augmentation, and 2) The increasing popularity of virtualization, where NICs should support switching functionality, and 3) the rise of multi-tenant data centers and cloud computing for which NICs must provide isolation mechanisms. However, a drawback of P4-based NICs is the complexity of the platforms. High-level programmable platforms support limited network functions. Existing NICs do not offer a systematic methodology for chaining multiple offload capabilities. According to Le et al. <ref type="bibr">[152]</ref>, they cannot fully implement complex network functions required by some encryption schemes or deep packet inspection. P4-programmable NICs present a higher level of flexibility and programmability. These programmable NICs can offload almost any packet processing function from the server. Moreover, they employ energy-efficient processors compared to x86-based server processors, achieving higher energy efficiency in packet processing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3.3.">Comparison with Legacy Approaches</head><p>P4-programmable NICs offer more flexibility in offloading TCP functionalities than traditional NICs, which have hardwired offloading modules (e.g., TSO, GSO, checksums). P4-programmable NICs present specialized packet engines that allow the implementation of a variety of protocol accelerators. Therefore, P4programmable NICs can be considered as a generalpurpose offloading platform to develop key/value store applications, microservices, and other types of network functions. P4-programmable NICs can be considered network accelerators capable of offloading communication routines and computational kernels from the CPU. On the other hand, programmable NICs that do not use P4 language (e.g., Mellanox BlueField SmartNICs <ref type="bibr">[153]</ref>) present a lower degree of flexibility, making the development process more difficult and closed to a vendor technology. P4-programmable NICs offer similar performance and a scalable solution for large and diverse deployments. Moreover, with P4, the NIC's functionalities can be extended to support applications, thus minimizing CPU intervention. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.">Summary and Lessons Learned</head><p>P4-programmable NICs complement server operations by not just simple forwarding packets. Instead, they can perform complex operations and speed up the delivery of services that otherwise are affected due to the usage of the general-purpose CPU. Offloading computation from a server's CPU to a programmable NIC releases a substantial amount of the server's CPU resources, making programmable NICs attractive for cloud operators to save costs. Firestone <ref type="bibr">[154]</ref>, considers that programmable data planes improve cloud computation due to their high performance and low overhead. These are two essential characteristics for data centers to reduce costs and increase performance. In the last decade, the network speed increased around 100Gbps and included new services and applications. This increase demands an efficient solution that reduces CPU cycles raises the customer's overall cost, and reduces the available processing power. To this extent, ensuring desirable TCP behavior usually entails performance issues, which increases the gap between CPU capacity and network bandwidth. McKeown <ref type="bibr">[155]</ref> remarked this difference. Future work should consider the security implications of bypassing kernel functionalities that involve checksum computation and firewalls policies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Network Measurements</head><p>Applications that use TCP as their transport protocol demand high throughput and low latency. Eventually, these applications can experience performance problems that are hard to diagnose due to the complexity of the TCP stack. Such complexity implies using heuristics to understand network conditions and application behavior. Therefore, there is no optimal setting to deal with and optimize TCP traffic which usually depends on application requirements.</p><p>Events that affect TCP performance are getting harder to diagnose because of the increasing link speed used in environments such as data centers. Moreover, the tools used to diagnose TCP performance are still the same as ten years (e.g., capturing packet traces and capturing TCP executions). It has been shown that these tools <ref type="bibr">[156]</ref><ref type="bibr">[157]</ref><ref type="bibr">[158]</ref> are effective to diagnose individual TCP connections. However, they are not scalable to diagnose problems in large data centers, where the number of flows can easily surpass one million.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.">Challenges on TCP Performance Monitoring</head><p>A TCP connection might encounter performance issues at the sender, receiver, or network. Table <ref type="table">13</ref> summarizes some examples of performance problems corresponding to each element of the network. Considering the significant amount of applications that use TCP, it is hard to select the right metrics to identify TCP performance issues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.1.">Identifying Issues in the Sender</head><p>TCP performance problems at the sender may occur due to limited resources. For example, a slow disk, slow CPU produce data at a lower rate than the maximum capacity the network can handle. This limitation is also known as the application is non-backlogged. To identify this issue, measurement schemes can count sent packets and compare them to the estimated sending rate, which is defined by receive and congestion windows. Notice that the resource's limitation on the receiver side can affect the performance too.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.2.">Inferring Network Statistics</head><p>Network problems degrade TCP's performance. For example, TCP reduces its sending rate as a result of packet losses and high latency. The sender determines its sending rate based on its congestion control algorithm, which comes in different flavors. That means that the dynamics of a TCP flow vary depending on the congestion control algorithm. Identifying the TCP congestion control and the number of TCP flows using a specific congestion control provide relevant information to notify network elements (i.e., sender, receiver, router, and switches) about the performance. Experimental evaluations <ref type="bibr">[159,</ref><ref type="bibr">160]</ref> reported that BBRv1 present poor coexistence with algorithms such as CUBIC. However, legacy  Before the failure, the primary path is through AS1. Blink identifies the failure in AS4 measuring the retransmissions. Blink immediately reroutes TCP flows to both AS2 and AS3. The Blink switch determines that AS2 is not working and forwards all the traffic through AS3 <ref type="bibr">[77]</ref>.</p><p>devices are unable to provide such information. With programmable data planes, developers can create schemes to infer the congestion control <ref type="bibr">[47]</ref> as well as many other metrics such as RTT, retransmission rate, link utilization, etc. Programmable data plane can provide realtime monitoring by extracting packet header information (i.e., packet parsing), observing the information carried by packets across multiple stages (i.e., P4 metadata), and track the state of TCP flows using registers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.3.">Tuning Receiver's Parameters</head><p>The receiving TCP window determines the amount of data the receiver can hold. The larger the buffer, the more data can be in flight between two hosts. If the TCP receiving buffer is smaller than the BDP, the sender waits longer for the receiver to acknowledge the data, resulting in degraded performance. Identifying the limitations of a TCP connection on the receiver side is essential to tuning the buffers correctly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.4.">Network Diagnosis</head><p>Network operators need to know how well their network performs to determine what services they can offer their customers. However, visualizing the network behavior is not an easy task. Legacy measurement schemes rely on polling and sampling methods. Thus, they cannot provide a granular view of the events that affect the performance. As the volume of TCP traffic keeps increasing, reporting network performance and detecting failures requires efficient schemes capable of processing, characterizing, and modifying TCP dynamics.</p><p>According to <ref type="bibr">[161]</ref>, the global IP traffic was 1.2 Zettabytes (ZB) per year (i.e., 1.2 billion Gigabytes (GB)) per month in 2016. They estimated that by 2021, the global IP traffic would reach 3.3 ZB per month, which means that the global IP traffic in 2021 is &#215; 127 larger than in 2005. Recently, during the COVID-19 pandemic, the Internet traffic increased between 25% and 35% in march 2020. This is due to the worldwide lockdowns and the shift to remote working policies <ref type="bibr">[162]</ref>.</p><p>The abrupt increase of Internet traffic demands an efficient way to measure network performance and detect failures. P4-programmable devices emerge as a promising solution to implement novel measurement schemes by addressing the inherent shortcomings of legacy networks. P4-based schemes can accurately report network issues and take countermeasures oriented to mitigate network problems. The source of such failures might have different and diverse sources (e.g., traffic bursts, resource allocation, protocol limitations, malicious or misconfigurations) being a challenge to implement an efficient network diagnosis scheme.</p><p>Sampling and polling methods do not have the resolution to detect events in the order of microseconds (e.g., typically, 1/30,000 packets), which provides coarsegrained visibility. The inherent high sampling and polling rate thwarts the development of measurement applications. For instance, it is impossible to measure the dynamics of TCP parameters such as congestion window, receive window, sending rate, and others. P4 Programmable data planes enable the possibility to perform fine-grained measurements in the data plane at a line rate (e.g., &#8764;100Gbps). P4 Programmable switches provide flexibility to inspect traffic with high accuracy. They also allow programmers to take actions almost in real-time (e.g., dropping a packet after a threshold is surpassed, as done in AQMs).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.5.">Literature Review</head><p>Holterbach et al. <ref type="bibr">[163]</ref> designed a system that recognizes a disruption in the TCP behavior, facilitating its tracking. The system is called Blink, and it is based on the predictable behavior of a TCP connection upon a packet loss. Figure <ref type="figure">28</ref> shows a scenario where a Blink switch redirects the traffic upon detecting a failure. Whenever Blink detects a failure, backup paths that are pre-populated by the control-plane, are immediately activated. Blink also permits policy-based rerouting. Thus, network operators can adapt the system according to their needs. The authors also considered security aspects. For instance, if an attacker generates significant traffic to redirect all the traffic through a malicious path, Blink can differentiate whether the traffic is legitimate and decide accordingly. The authors evaluated Blink's performance considering a large number of flows. Results indicate that the system achieves a fast reaction time to redirect Internet traffic. It prevents unnecessary traffic shifts even in noisy scenarios. Ghasemi et al. <ref type="bibr">[75]</ref> presented a system to diagnose cloud performance using programmable data planes. The system is called Dapper, and it analyzes TCP performance in real-time at the hypervisor, NIC, or top-of-rack switch. The system identifies when a connection is limited by the sender, the network, or the receiver. For instance, when a slow server competes for shared resources or the network experiences congestion, the receiver has a small receive TCP buffer. The authors evaluate the accuracy of the system emulating three scenarios: 1) The sender presents a low performance (e.g., slow disk or busy CPU), 2) The receiver has limited resources such as small TCP receiving buffer size, 3) The network experience congestion. Figure <ref type="figure">29</ref> depicts the system's architecture. Dapper measures the metrics produced by the sender, network, and receiver to detect end-to-end performance issues. It helps to identify which entity is responsible for poor performance. Such procedure is usually the most challenging part of failure detection and can take from an hour to days in data centers <ref type="bibr">[164]</ref>. Once the system identifies the bottleneck, specialized tools within that component can pinpoint the root cause. Dapper infers key TCP such as counting the number of bytes, packets sent or received, congestion, and receive windows. Dapper achieves an average accuracy of 94% detecting problems regardless of their nature. The authors also mention that the accuracy is proportional to the severity of the problem. However, Dapper provides an average accuracy of around 94%. Liu et al. <ref type="bibr">[165]</ref> propose a sketch-based performance monitoring scheme that allocates sublinear memory as a function of the number of flows. A sketch is a reduced data structure that comprises the main characteristics of a more extensive set of data. Typically, performance monitoring requires combining information across pairs of packets (e.g., matching a packet with its acknowledgment to compute the RTT). The authors tested the system in six platforms (e.g., P4, FPGA, GPU, multi-core CPU, and OVS), being the one implemented in P4 the fastest. Results show that the system detects &#8764;82% of top 100 problematic flows among real-world packet traces using just 40KB memory in the Tofino ASIC.</p><p>Wang et al. <ref type="bibr">[79]</ref> proposed a P4-based TCP-friendly meter that does not degrade TCP performance. They </p><p>(2) (3) (4)</p><p>Figure <ref type="figure">30</ref>: Scheme of a single rate two color meter algorithm. The meter uses a bucket to accumulate tokens that are generated at the speed of the meter. The meter marks incoming packets on red or green depending on the packet's size. If the size of a packet exceeds the current level of tokens in the bucket, it is marked red and then dropped. Otherwise, the packet is marked green and passes the meter.</p><p>found that meters in legacy switches interact with TCP flows so that these flows can only reach about 10% of the target rate. Based on their study, the authors found that the TCP performance degradation problem is caused due to the harmful interaction between the metering algorithm and the TCP congestion control. The system integrates a single rate two color meter algorithm, ECN, RED <ref type="bibr">[8]</ref>, and an adaptive control scheme <ref type="bibr">[166]</ref>. Figure <ref type="figure">30</ref> describes the flow chart of a single rate two color meter algorithm used in the scheme. The meter uses a token bucket to accumulate tokens that are generated at the rate of the meter. The meter marks incoming packets on red or green depending on the packet's size. If the size of a packet exceeds the current level of tokens in the bucket, it is marked red and then dropped. Otherwise, the packet is marked green and passes the meter. Experimental evaluations show that the proposed system achieves rates of up to 85% of the target rate.</p><p>Chen et al. <ref type="bibr">[78]</ref> leverage on programmable data plane to implement a passive tool to measure the RTT. The proposed algorithm calculates the RTT by matching the sequence number Seq of each packet with its corresponding acknowledgment ACK. Compared to traditional measurement systems, which are mainly based on active probing, the proposed system produces many samples for long TCP connections. Due to the memory limitations, the authors implemented the records in a multi-stage hash table that assigns an expiration time for each connection. Therefore, the entries matching with incoming packets produce RTT samples and consequently are deleted. On the other hand, those packets that never matched with their corresponding ACK are tagged as expired based on their timestamps and are overwritten when hash collisions occur. Results report over 99% of all RTT samples obtained in a campus network. The evaluation also reports that the data structure employed to store the sequence numbers achieves the best performance for RTT monitoring. Future works consider including anomaly de- tection.</p><p>Wang et al. <ref type="bibr">[76]</ref> contributed with a similar work that addresses a limitation of the previous system, detecting the root cause of performance degradation in data center networks. The authors analyzed similar solutions and pointed out weaknesses such as the overhead produced by the data collection, the difficulties in locating the point of failure, and the tuning requirements of query-driven solutions. The proposed solution is called SpiderMon, and it is composed of two phases. The first phase keeps track of the performance of every flow until it detects a problem. When this happens, the system passes to the second phase and triggers a debugging process that uses a causality analyzer to find the root cause of the performance degradation. Figure <ref type="figure">31</ref> shows the system architecture. SpiderMon precisely collects diagnostic information without producing excessive overhead, i.e., it can accurately find the root cause with a limited number of telemetry data.</p><p>Chen et al. <ref type="bibr">[80]</ref> presented ConQuest, a compact data structure implemented in the data plane that identifies the flows making a significant contribution to the queue. Excessive queueing leads to higher delay and eventually to packet drops. This problem escalates in the presence of microbursts, which is phenomenon that increases the queue abruptly in a short period (i.e., in a sub-millisecond scale), exhausting the egress buffer. ConQuest can identify contributing flows with 90% precision on a 1 ms timescale, using less than 65 KB of memory. The system is implemented using a measurement data structure that operates at line rate to detect and reduce even shortlived queue buildup as it forms. The authors determined the optimal queue length to allow a single TCP connection to reach line rate based on the minimum congestion window size required based on the RTT.</p><p>Kim et al. <ref type="bibr">[81]</ref> introduce Meta4, a framework used to implement network monitoring by the domain name in the data plane by collecting and organizing information from the DNS response into match action tables. The system aims at reporting statistics such as the traffic volume based on a user-defined policy. Figure <ref type="figure">32</ref>, the system is divided into three levels, the management plane, the control plane, and the data plane. In the management plane, the network operator defines high-level policies that the control plane converts into match-action rules that will the network operator, the control plane, and the data plane. The network operator defines high-level policies that the control plane converts into match-action rules for the data plane <ref type="bibr">[81]</ref>.</p><p>run in the data plane. Therefore, the network operator can measure policy-based metrics such as the traffic volume per domain. Additionally, the scheme ensures privacy by not exposing individual IP addresses. The scheme can be expanded to cover other applications that include implementing custom traffic monitoring policies, firewalls, limiting the rate, and rerouting. In another work, Chen et al. <ref type="bibr">[167]</ref> developed a system that uses programmable data planes to monitor the queue latency in a legacy switch. The programmable data plane taps both ingress and egress ports and computes the time difference between incoming and outgoing packets to infer the queue length. Results collected in a campus network demonstrated the heterogeneous nature of the events that lead to high queueing delay. Such events result from 1) Steady-state congestion, which consists of high queueing delays for a long time. 2) Bursty traffic makes the queueing delay oscillate due to concurrent large flows competing for bandwidth. Notice that bursty events diminish the capacity of TCP congestion control algorithms to probe the bottleneck bandwidth correctly. Therefore, also affecting other TCP flows and leading to poor link utilization. 3) Outlier high-delay packets, the authors detected link underutilization and packets with a low queuing delay caused by a full queueing buffer. The latter requires further research.</p><p>In a recent work, Kfoury et al. <ref type="bibr">[168]</ref> designed a P4based scheme that dynamically adjusts the buffer size of legacy routers. The programmable switch is employed as a passive tool that measures the RTT and the number of flows to set the buffer size of a legacy router and reduce unnecessary delays. Figure <ref type="figure">33</ref> shows a high-level overview of the proposed system. The dynamic buffer adjusting process consists of three steps: (1) A copy of the traffic is received by the P4 switch. <ref type="bibr">(2)</ref> The RTT of individual TCP flows is computed at the P4 switch's data plane. (3) The P4 switch modifies the routers buffer size according to the equation RT T / &#8730; N <ref type="bibr">[117]</ref>. Results show an improvement in metrics such as the RTT, fairness, and the loss rate. Moreover, the FCT of short flows is also improved compared to a scenario where the buffer size remains fixed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.6.">Network Performance Measurements Schemes</head><p>Comparison, Discussions, and Limitations Table <ref type="table">8</ref>.1.5 compares the schemes described in the previous section. Considering that TCP is the most  Dynamic buffer tuning. The idea of dynamically modifying the buffer started with Flow Proportional Queuing (FPQ) <ref type="bibr">[17]</ref> which adjusts the amount of buffering according to the number of TCP flows. Further schemes <ref type="bibr">[18]</ref><ref type="bibr">[19]</ref><ref type="bibr">[20]</ref> also considered modifying the buffer size based on network traffic. However, such schemes have a main limitation; they assume that their methods will be implemented on contemporary routers. Such process is length and costly and most likely, router manufacturers will not modify their existing devices <ref type="bibr">[21]</ref>. In fact, such schemes and other Active Queue Management (AQM) algorithms have been proposed more than ten years ago, and are still not implemented on contemporary routers in the market today, though the problem of buffer sizing is still being thoroughly researched <ref type="bibr">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Measurements using programmable data planes</head><p>Ghasemi et al. <ref type="bibr">[11]</ref> proposed Dapper, a system that uses programmable switches to diagnose the cause of congestion. Metrics estimated in Dapper include the Round-trip Time (RTT), in-flight bytes, and loss rate. Chen et al. <ref type="bibr">[22]</ref> proposed a system that leverages programmable switches to passively compute the RTT of TCP traffic. Kagami et al. <ref type="bibr">[12]</ref> proposed CAPEST, a method that collects timing information to estimate the network capacity and the available bandwidth. Kfoury et al. <ref type="bibr">[23]</ref> proposed a P4-based method to automate end-hosts' TCP pacing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. PROPOSED SYSTEM</head><p>Fig. <ref type="figure">1</ref> illustrates an overview of the proposed system. The scheme can be used in both access and core networks. The steps to dynamically modify the router's buffer size are: 1) a copy of the traffic is forwarded to a programmable switch by passively tapping on routers' ports; 2) the programmable switch identifies, tracks, and computes the RTT of long flows. Afterwards, the computed statistics are pushed to the control plane where the buffer size is calculated; and 3) the programmable switch modifies the legacy router's buffer size.</p><p>It is worth noting that programmable switches are significantly cheaper than contemporary switching/routing devices. This is because they are whiteboxes, and are equipped with  makes the system more scalable. Similarly, Blink considers metrics such as the retransmissions rate, packet loss rate, round-trip-time, and outof-order packets to identify the top-k problematic flows. Moreover, Blink identifies failures considering the predictable behavior of TCP, which retransmits packets at epochs exponentially spaced in time in the presence of failure. For instance, Blink promptly reroutes traffic whenever failure signals are generated by the data plane, while SpiderMon limits the root cause of hosts' sending rate. Schemes such as SpiderMon base the failure identification considering latency increase. Other schemes (e.g., Dapper) leverage reactive processing to identify and mitigate failures. Finally, it is worth mentioning that some systems (e. g., Blink, Dapper) used real-world data, such as the ones provided by CAIDA, to evaluate their systems. Using real-world traces helps to adjust the proposed solution in production network scenarios.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1.7.">Comparison with Legacy Approaches</head><p>The principal difference in network diagnosis schemes between P4-programmable and legacy approaches is the granularity provided by the data plane. Such features enable the implementation of diagnosis schemes that can detect and execute actions at a line rate. Moreover, P4programmable data planes can process failure events before informing the control plane, such as in <ref type="bibr">[163]</ref>. The drawback of P4 schemes against legacy devices is the lack of a protocol that defines the interaction between the control and management planes. Vendors have defined solutions such as NetFlow, JFlow, or Internet standards such as the Simple Network Management Protocol (SMNP). However, these schemes lack granularity. Lastly, P4-based diagnosis schemes can take custom actions (e.g., rerouting) to react to failure events. In legacy devices, packet manipulation is restricted, limiting the control plane's capacity to perform actions.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2.">Summary and Lessons Learned</head><p>Network monitoring and measurements are a valuable tool for diagnosing TCP performance degradation and locating the point of failure. The measurements work presented and discussed in previous sections try to minimize overheads, evaluate the severity of the problem, and identify where the failure occurs. Future works should consider integrating with legacy networks, reducing storage requirements, improving the accuracy of the issue evaluation, extending monitoring primitives, and minimizing controller intervention. Some schemes focus on providing a response mechanism upon failure detection to improve QoS by applying traffic policing and management. Techniques adopted include application-layer inspection, traffic metering, traffic separation, and bandwidth management.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Challenges and Future Trends</head><p>This section presents the challenges and future trends of P4-based schemes that improve TCP performance. The challenges are inferred from the limitations of the surveyed works. The trends include the initiatives that aim at solving those challenges. Figure <ref type="figure">34</ref> shows the challenges and their corresponding future trends.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.1.">Arithmetic Computation</head><p>Programmable data planes are limited in arithmetic computation, such as floating-point operations, which consume more hardware resources than integer operations and process the inputs in a highly sequential and multistage manner. Therefore, programmers recur to approximation techniques that involve integers, which produce an output at a predictable pace. Furthermore, there are data plane architectures that use approximations to implement filters <ref type="bibr">[179]</ref>. For instance, a scheme can use a low-pass filter to smooth the queue occupancy. However, approximating arithmetic operations in the data plane consume significant hardware resources and restricts concurrent execution of data plane programs. Operations demanded by AQMs (e.g., square root function in the CoDel algorithm) are challenging to be implemented with P4 <ref type="bibr">[11,</ref><ref type="bibr">53,</ref><ref type="bibr">54]</ref>. Extending the supported arithmetic operations will lead to a more accurate estimation of the traffic entropy and computation of TCP statistics (e.g., RTT calculation, CWND's size estimation, and congestion control inference).</p><p>Current and Future Initiatives. Approximation and pre-computation methods have been implemented to overcome such limitations. Schemes that use approximations rely on a small set of supported operations to approximate the desired value with a precision penalty. For instance, approximating the square root function can be achieved by counting the number of leading zeros through the longest prefix match <ref type="bibr">[53]</ref>. The pre-computation method involves the participation of the control plane to provide the set of values. The results are stored in registers and match action tables, which present limited storage and incurs additional processing time when the control plane intervenes to update them.</p><p>Current initiatives such as the one proposed by Swamy et al. <ref type="bibr">[169]</ref> exploits parallelism based on a map-reduce abstraction. This approach will help programmable network devices, such as switches and NICs, identify and offload complex computations that can be pre-evaluated in the control plane, reducing computation and communication overhead simultaneously. Ding et al. <ref type="bibr">[180]</ref> implemented in P4 the logarithmic and exponential function entirely in the data plane without using TCAMs, which are scarce resources. Both functions aim at calculating the entropy, which measures the traffic distribution <ref type="bibr">[181]</ref>. The entropy calculation contributes to determine performance issues derived from TCP congestion control algorithms <ref type="bibr">[182]</ref>, load-balancing <ref type="bibr">[183]</ref> and network anomalies detection <ref type="bibr">[184,</ref><ref type="bibr">185]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.2.">Fast Loss Detection</head><p>Packet losses are expected in the network, and they can happen for several reasons. In <ref type="bibr">[186]</ref>, the author quantifies the impact of packet losses in data center environments. Packet losses significantly affect TCP performance, and consequently, operators are interested in identifying the root cause of losses to recover from them. Therefore, it is essential to detect losses fast and independently of their origin to diagnose the network and mitigate the issue. However, legacy monitoring schemes often fail to capture losses quickly, with enough details and low overhead. Current P4 implementations <ref type="bibr">[43,</ref><ref type="bibr">44,</ref><ref type="bibr">187]</ref> do not infer the root cause of packet losses, and only react to congestion.</p><p>Current and Future Initiatives. P4-programmable data planes provide the flexibility to use custom data structures, enabling the possibility of implementing loss detection schemes. Li et al. <ref type="bibr">[170]</ref> implemented Lossradar, a fast packet loss detection scheme implemented using programmable data planes that identifies the location of the packet losses. The system is based on traffic digests resulting from comparing traffic meters. If a digest corresponds to the same packet, it cancels each other. Otherwise, it will persist and indicate that a loss has occurred. The system presents low overhead, which is proportional to the number of packet losses. Additionally, the digest includes flow-level information and a packet identifier, which contributes to identifying the source of packet losses. Future works should focus on improving the root case inference algorithm. An option is adding machine learning techniques instead of threshold-based approaches. In this way, the mechanism will classify the type of losses with more accuracy. Furthermore, it could also categorize the type of losses and react according to the event. A future work, should consider a scheme that tolerates losses induced by non-loss based congestion control algorithm (e.g., BBR) since it does not affect the throughput significantly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.3.">Enhancing TCP Congestion Feedback</head><p>Traditional TCP congestion control algorithms do not differentiate the cause of packet drops. The sender considers timeouts of a retransmit timer or receiving three duplicate ACKs as a signal of congestion and reacts by reducing the rate usually according to a multiplicative decrease function. However, packet drops can have different causes such as packet corruption, excessive delay due to the bufferbloat, small TCP receive buffer, and other misconfigurations. Therefore, an improper reaction to losses can lead to the underutilization of network resources. Moreover, with the advent of 5G deployments, there is a need for enhanced mechanisms that reduce packet losses to fulfill the QoS requirements. Although P4-programmable switches enable the possibility of enhancing TCP congestion feedback, the sender must cooperate by understanding the congestion signals. However, the literature reports that even RFC-standardized congestion feedback such as ECN is rarely enabled in production networks <ref type="bibr">[83]</ref>.</p><p>Current and Future Initiatives. Many leveraged on P4-programmable data planes aim at enhancing TCP congestion feedback <ref type="bibr">[43]</ref><ref type="bibr">[44]</ref><ref type="bibr">[45]</ref>. They propose novel schemes that use, for example, INT, EECN, and custom headers to notify the sender about the cause of the congestion promptly. The main goal of such schemes consists of making congestion control algorithms to take agile control decisions and improve robustness. Moreover, P4based schemes are not restricted to observe local failure. Rather, they can provide a global view of the congestion event.</p><p>Schemes that aim at enhancing TCP congestion feedback demonstrated a better performance than legacy approaches. However, many of these schemes are designed and implemented on data center networks, where delays are lower than WANs. Recently, Chen and Nagaoka <ref type="bibr">[188]</ref> reported that ECN is not widely adopted in the Internet. They highlight the need to incentivize ISP providers to enable ECN in their deployments. Therefore, future works can explore how P4-based schemes that implement an enhanced feedback mechanism affect TCP performance in a WAN, where the conditions (e.g., delay, loss rate, number of flows, type of traffic) differ from data center environments. Moreover, implementing in P4 congestion controls mechanisms such as eXplicit Control Protocol (XCP) <ref type="bibr">[189]</ref> and Rate Control Protocol (RCP) <ref type="bibr">[190]</ref>, could evidence how enhanced feedback impacts in networks with large BDP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.4.">TCP State Synchronization</head><p>The processing speed mismatch and communication overheads between data and control plane establish a challenge to achieve TCP state synchronization. The data plane operates at line rate and constantly updates the state of packets in the switch. On the other hand, the control plane executes operations to read and manipulate data plane states. Therefore, applications requiring state synchronization to work demand a mechanism that provides a better view of the network. Nevertheless, it is challenging to reach an accurate state synchronization in programmable networks because legacy implementations usually cannot take actions at line rate. For example, tracking the dynamics of TCP flow characteristics is an application that requires accurate state synchronization <ref type="bibr">[191]</ref>. Another practical example involves ensuring perflow consistency, which occurs when load balancers ensure that all TCP packets are delivered to the receiver before changing the route. Such action can only be achieved if the switches along the path have precise information about the state of a TCP connection.</p><p>Current and Future Initiatives. Luo <ref type="bibr">[171]</ref> et al. proposed Swing State, a data plane framework to consistently move/replicate states among devices. The system uses each packet to record state values and pass them to the next P4 switch. Once the states are synchronized, flows can be migrated using any existing network update technique such as <ref type="bibr">[172,</ref><ref type="bibr">173]</ref>. State synchronization contributes to the accurate detection of events such as microbursts and TCP incast. Synchronized measurements facilitate applications to find issues, especially when TCP incast and microburst are present in the network. Yaseen et al. <ref type="bibr">[172]</ref> highlighted that current schemes that detect such kinds of events are empirical (e.g., packet counting, TCP timeouts, and packet drops detection). Consequently, with an accurate TCP state synchronization, operators can have a global view of the dynamics of the network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.5.">Expanding Memory Capacity</head><p>P4 allows the programmers to store and retrieve data of packets using registers, counters, and meters. This feature enables stateful processing to implement features such as queueing algorithms, classifiers, and policiers, which are essential to implement programs in the data plane. Future works aim to expand the memory capacity to handle a more significant number of flows. However, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RDMA Initialization</head><p>Figure <ref type="figure">35</ref>: External memory for data plane scheme. The data plane can utilize remote memory region registered RNICs by servers that are connected to the switch. The system comprises a remote lookup table, as an example. Other data structures can also be used <ref type="bibr">[174]</ref>.</p><p>stateful processing is limited by the memory capacity of the switch. For example, schemes like Dapper <ref type="bibr">[75]</ref> can not scale to handle more than 10,000 flows because each flow consumes 67 bytes (i.e., sixteen four-byte registers to keep the flow state, a two-byte register to track MSS, and a one-byte register for scale), in addition of 40 bytes of metadata used to carry packet's information. Therefore, to scale a scheme like Dapper, additional fast access memory is required to reduce collisions in the table.</p><p>Expanding the switch's memory capacity will allow integrating more features in the device. Another example is BurstRadar <ref type="bibr">[192]</ref>, that with 300 table entries can handle up to ten simultaneous microbursts with a packet loss rate below 1%. The authors mention that 1000 entries are required to reduce the miss rate to absolute 0%. Microbursts are events in the network that lasts from 10us to 100us and produce intermittent congestion. That congestion increases latency, cause jitter and packet losses in data center networks <ref type="bibr">[193]</ref>. Events that produce microbursts include TCP incast and TCP segment offloading. To conclude, the authors remark that microburst is becoming more common as the link speeds are increasing over 10G, while switch buffers sizes do not change.</p><p>Current and Future Initiatives. Kim et al. <ref type="bibr">[174,</ref><ref type="bibr">175]</ref> proposed expanding the memory capacity of the switch by accessing Dynamic Random Access Memory (DRAM) installed on data center servers directly from the data plane. This approach is flexible since it uses existing resources available in commodity hardware without adding additional costs. The approach presented in Figure <ref type="figure">35</ref>, is implemented using RoCE, and it only suffers a latency penalty of an extra 1-2 us. Tierney et al. <ref type="bibr">[176]</ref>, tested TCP performance using a similar infrastructure. The experiment consisted of using a dedicated path for RoCE traffic and simultaneously generating TCP traffic. The authors concluded that the ability of RoCE to provide low latency and system overhead makes TCP a compelling technology for high-resolution data transfers. Future works should also consider supporting common data structure in the remote memory. The scheme discussed in this section only implements address-based memory access, meaning they do not support ternary matching and other data layouts in remote memory.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.6.">Improving TCP Monitoring Schemes</head><p>Monitoring and diagnosing TCP performance issues is essential for an efficient network operation. Current monitoring system are mainly query <ref type="bibr">[194]</ref><ref type="bibr">[195]</ref><ref type="bibr">[196]</ref> or sample based <ref type="bibr">[23,</ref><ref type="bibr">38,</ref><ref type="bibr">39]</ref>. However, such schemes lack from accuracy to correctly track the dynamics of TCP flows (e.g., microburst detection). Network telemetry provides the fundamentals for network management applications, for example, network performance monitoring, debugging, failure localization, load balancing, and security. INT is a network monitoring framework available in programmable data planes that enhances network visibility making it attractive for production networks <ref type="bibr">[43]</ref>. However, a major drawback is the telemetry overhead produced on packets due to each switch adds telemetry information to packets which reduces the MSS linearly. Consequently, there could be the case where there is more telemetry information added to the packet than the packet payload, resulting in the application fragmenting a payload into multiple packets and consequently experience reduced performance.</p><p>Current and Future Initiatives. Future initiatives should focus on reducing telemetry overhead. A solution could be limiting the telemetry information that each packet can afford. Therefore, developers establish the trade-off between the MSS and the telemetry headers. This is supported by the fact that applications can have good network visibility with approximated telemetry information. For instance, checking a flow's path conformance is possible by analyzing a collection of packets. Another consideration regarding TCP congestion control algorithms is that they can achieve their goal if packets convey information about the path's bottleneck without requiring knowledge about all hops. Lin et al. <ref type="bibr">[177]</ref> tacked this problem differently by proposing a finegrained real-time telemetry scheme. The scheme relies on the observation and recording (O&amp;R) of the states of the packet being forwarded by the data plane. The scheme does not measure conventional performance parameters (i.e., end-to-end delay, jitter throughput, and packet loss rate). However, it has a clock offset elimination algorithm to reduce the time synchronization of two adjacent switches. Future works could integrate various monitoring features (e.g., RTT, packet loss, link utilization, failures, etc.) implemented in P4 to monitor if the traffic meets the Service Level Agreements (SLA).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.7.">Reducing Control Plane Intervention</head><p>Control-plane operations are essential to support and complement network protocols. However, control plane intervention can result in additional latency and consequently affect TCP performance. For example, reroutingbased monitoring queries often need more storage or access to complex computation. The hardware of the data plane is limited regarding storage and, it is not designed to perform complex operations. Therefore, the control plane intervenes at the cost of losing performance, which should be minimized when it is possible.</p><p>Current and Future Initiatives. Control plane intervention can be reduced by optimizing the P4 program to reduce the communication overhead between the control and data plane. This task is performed by programmers with enough background to create schemes that depend more on the data plane for taking actions. Although schemes that aim at improving TCP can be implemented in the data plane, actions such as rerouting <ref type="bibr">[50,</ref><ref type="bibr">163]</ref> and complex arithmetic operations <ref type="bibr">[54]</ref> require control plane intervention. Future initiatives should consider developing tools that detect and reduce the interaction between control and data plane by suggesting alternative workflows and code optimization to minimize such interaction <ref type="bibr">[178]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.8.">Improving the QoS of TCP-based Applications</head><p>Providing QoS requires metering the traffic and arbitrarily dropping packets when the rate exceeds the target. Major vendors on the market offer meter functions. Meters measure the arrival rate of a flow and determine its priority as a function of the predefined rate. Packets that exceed the specified rate for the flow are marked with a lower priority and eventually discarded. Nevertheless, meters are usually not TCP-friendly, meaning that they do not regulate the rate properly or increase the loss rate. As a result, the achieved rate of a TCP flow is usually lower than the meter's target rate.</p><p>Current and Future Initiatives. Wang et al. <ref type="bibr">[79]</ref> designed and implemented TCP-friendly meters using programmable data planes. The authors mitigated the bad interaction between the metering algorithm and the TCP congestion control in this work. The scheme ensures that the achieved rate of a TCP flow is close to the target rate established by the meter. The authors mention that future works in this area should consider optimizing data plane resource utilization and test the scheme with different TCP variants. The author also highlighted that the scheme's performance could be further improved if the data plane allows floating-point operations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.9.">Expanding In-network Computation</head><p>With the increasing adoption of high-speed networks and cloud-based applications, server-based processing has become a bottleneck due to its high overhead <ref type="bibr">[197]</ref><ref type="bibr">[198]</ref><ref type="bibr">[199]</ref> (i.e., processing and management overhead). Kernel bypassing is a solution that reduces overheads, improves performance and flexibility by providing direct communication between the NIC and the application. In this approach, the kernel communication overhead is reduced, and the application performs TCP handling. In this way, multiple applications can rapidly access the NIC. However, with this approach, each application must enforce the correctness of the underlying protocols such as TCP. This limitation implies that TCP congestion control algorithms are implemented in the application. Moreover, since the kernel is being bypassed, the OS cannot perform additional packet processing. Such processing includes demultiplexing packets, egress packet filtering, and resource allocation <ref type="bibr">[134]</ref>.</p><p>Current and Future Initiatives. P4-programmable NICs offer the tools to offload the server's operations, thus performing the computation closer to the network.</p><p>With the granularity provided by P4, cloud providers could have the ability to provide fast and reliable services. Moreover, cloud providers can use data plane resources to measure resource usage accurately. Current initiatives in P4 propose a serverless framework to offload workloads <ref type="bibr">[65,</ref><ref type="bibr">72,</ref><ref type="bibr">200]</ref>, which includes TCP and other applications. An example of a P4 serverless implementation is &#955;-NIC, which runs microservices entirely in the NIC. Future works could consider expanding the type of application that can be handled by the NIC, considering that the migration from server to serverless application increases the performance at a marginal cost <ref type="bibr">[201]</ref>.</p><p>Atutxa et al. <ref type="bibr">[202]</ref> performed in-network computation using P4-programmable data planes. The system aims to reduce time-constrained applications' response times by performing in-network computation. The system processes Message Queueing Telemetry Transport (MQTT) packets originated by an IoT device and generates an alarm when a threshold is exceeded. Results show that response times are reduced by 74% by offloading the computation to the P4 switch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.">Conclusion</head><p>This paper presents a survey on TCP enhancement using P4-programmable devices. It summarizes and discuss recent works on P4-programmable devices, focusing on schemes aimed at enhancing TCP performance. A taxonomy classifying the aspects that affect TCP's behavior, such as congestion control, Active Queue Management (AQM) algorithms, TCP offloading, and network measurements schemes. Then, it compares the works pertaining to each category and with the legacy implementations. Lastly, the challenges and future trends are discussed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="11.">Acknowledgement</head><p>This work was supported by the U.S. National Science Foundation under grant number 2118311, funded by the Office of Advanced Cyberinfrastructure (OAC). </p></div></body>
		</text>
</TEI>
