skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Practical Packet Deflection in Datacenters
Bursts, sudden surges in network utilization, are a significant root cause of packet loss and high latency in datacenters. Packet deflection, re-routing packets that arrive at a local hotspot to neighboring switches, is shown to be a potent countermeasure against bursts. Unfortunately, existing deflection techniques cannot be implemented in today's datacenter switches. This is because, to minimize packet drops and remain effective under extreme load, existing deflection techniques rely on certain hardware primitives (e.g., extracting packets from arbitrary locations in the queue) that datacenter switches do not support. In this paper, we address the implementability hurdles of packet deflection. This paper proposes heuristics for approximating state-of-the-art deflection techniques in programmable switches. We introduce Simple Deflection which deflects excess traffic to randomly selected, non-congested ports and Preemptive Deflection (PD) in which switches identify the packets likely to be selected for deflection and preemptively deflect them before they are enqueued. We implement and evaluate our techniques on a testbed with Intel Tofino switches. Our testbed evaluations show that Simple and Preemptive Deflection improve the 99th percentile response times by 8× and 425×, respectively, compared to a baseline drop-tail queue under 90% load. Using large-scale network simulations, we show that the performance of our algorithms is close to the deflection techniques that they intend to approximate, e.g., PD achieves 4% lower 99th percentile query completion times (QCT) than Vertigo, a recent deflection technique that cannot be implemented in off-the-shelf switches, and 2.5× lower QCT than ECMP under 95% load.  more » « less
Award ID(s):
2313164
PAR ID:
10543923
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Networking
Volume:
1
Issue:
CoNEXT3
ISSN:
2834-5509
Page Range / eLocation ID:
1 to 25
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Core-Stateless Fair Queueing (CSFQ) is a scalable algorithm proposed more than two decades ago to achieve fair queueing without keeping per-flow state in the network. Unfortunately, CSFQ did not take off, in part because it required protocol changes (i.e., adding new fields to the packet header), and hardware support to process packets at line rate. In this paper, we argue that two emerging trends are making CSFQ relevant again: (1) cloud computing which makes it feasible to change the protocol within the same datacenter or across datacenters owned by the same provider, and (2) programmable switches which can implement sophisticated packet processing at line rate. To this end, we present the first realization of CSFQ using programmable switches. In addition, we generalize CSFQ to a multi-level hierarchy, which naturally captures the traffic in today's datacenters, e.g., tenants at the first level and flows of each tenant at the second level of the hierarchy. We call this scheduler Hierarchical Core-Stateless Fair Queueing (HCSFQ), and show that it is able to accurately approximate hierarchical fair queueing. HCSFQ is highly scalable: it uses just a single FIFO queue, does not perform per-packet scheduling, and only needs to maintain state for the interior nodes of the hierarchy. We present analytical results to prove the lower bounds of HCSFQ. Our testbed experiments and large-scale simulations show that CSFQ and HCSFQ can provide fair bandwidth allocation and ensure isolation. 
    more » « less
  2. Virtual switches, used for end-host networking, drop packets when the receiving application is not fast enough to consume them. This is called the slow receiver problem, and it is important because packet loss hurts tail communication latency and wastes CPU cycles, resulting in application-level performance degradation. Further, solving this problem is challenging because application throughput is highly variable over short timescales as it depends on workload, memory contention, and OS thread scheduling. This paper presents Backdraft, a new lossless virtual switch that addresses the slow receiver problem by combining three new components: (1) Dynamic Per-Flow Queuing (DPFQ) to prevent HOL blocking and provide on-demand memory usage; (2) Doorbell queues to reduce CPU overheads; (3) A new overlay network to avoid congestion spreading. We implemented Backdraft on top of BESS and conducted experiments with real applications on a 100 Gbps cluster with both DCTCP and Homa, a state-of-the-art congestion control scheme. We show that an application with Backdraft can achieve up to 20x lower tail latency at the 99th percentile. 
    more » « less
  3. The microservices architecture simplifies application development by breaking monolithic applications into manageable microservices. However, this distributed microservice “service mesh” leads to new challenges due to the more complex application topology. Particularly, each service component scales up and down independently creating load imbalance problems on shared backend services accessed by multiple components. Traditional load balancing algorithms do not port over well to a distributed microservice architecture where load balancers are deployed client-side. In this article, we propose a self-managing load balancing system, BLOC, which provides consistent response times to users without using a centralized metadata store or explicit messaging between nodes. BLOC uses overload control approaches to provide feedback to the load balancers. We show that this performs significantly better in solving the incast problem in microservice architectures. A critical component of BLOC is the dynamic capacity estimation algorithm. We show that a well-tuned capacity estimate can outperform even join-the-shortest-queue, a nearly optimal algorithm, while a reasonable dynamic estimate still outperforms Least Connection, a distributed implementation of join-the-shortest-queue. Evaluating this framework, we found that BLOC improves the response time distribution range, between the 10th and 90th percentiles, by 2 –4 times and the tail, 99th percentile, latency by 2 times. 
    more » « less
  4. null (Ed.)
    We previously proposed a method to locate high packetdelay variance links for OpenFlow networks by probing multicast measurement packets along a designed route and by collecting flow-stats of the probe packets from selected OpenFlow switches (OFSs). It is worth AQ1 noting that the packet-delay variance of a link is estimated based on arrival time intervals of probe packets without measuring delay times over the link. However, the previously used route scheme based on the shortest path tree may generate a probing route with many branches in a large network, resulting in many accesses to OFSs to locate all high delay variance links. In this paper, therefore, we apply an Eulerian cycle-based scheme which we previously developed, to control the number of branches in a multicast probing route. Our proposal can reduce the load on the control-plane (i.e., the number of accesses to OFSs) while maintaining an acceptable measurement accuracy with a light load on the data-plane. Additionally, the impacts of packet losses and correlated delays over links on those different types of loads are investigated. By comparing our proposal with the shortest path tree-based and the unicursal route schemes through numerical simulations, we evaluate the advantage of our proposal. 
    more » « less
  5. Recent work shows that programmable switches can effectively detect attack traffic, such as denial-of-service attacks in the midst of high-volume network traffic. However, these techniques primarily rely on sampling or sketch-based data structures, which can only be used to approximate the characteristics of dominant flows in the network. As a result, such techniques are unable to effectively detect low-volume attacks that stealthily add only a few packets to the network. Our work explores how the combination of programmable switches, Smart network interface cards, and hosts can enable fine-grained analysis of every flow in a network, even those with only a small number of packets. We focus on analyzing packets at the start of each flow, as those packets often can help indicate whether a flow is benign or suspicious. We propose a unified architecture that spans the full programmable dataplane to take advantage of the strengths of each type of device. We are developing new filter data structures to efficiently track flows on the switch, dataplane-based communication protocols to quickly coordinate between devices, and caching approaches on the SmartNIC that help minimize the traffic load reaching the host. Our preliminary prototype can handle the full pipe bandwidth of 1.4 Tbps of traffic entering the Tofino switch, forward only 20 Gbps to the SmartNIC, and minimize the traffic load to 5 Gbps reaching the host due to our efficient flow filter, packet batching, and SmartNIC-based cache. 
    more » « less