skip to main content


Title: zD: a scalable zero-drop network stack at end hosts
Modern end-host network stacks have to handle traffic from tens of thousands of flows and hundreds of virtual machines per single host, to keep up with the scale of modern clouds. This can cause congestion for traffic egressing from the end host. The effects of this congestion have received little attention. Currently, an overflowing queue, like a kernel queuing discipline, will drop incoming packets. Packet drops lead to worse network and CPU performance by inflating the time to transmit the packet as well as spending extra effort on retransmissions. In this paper, we show that current end-host mechanisms can lead to high CPU utilization, high tail latency, and low throughput in cases of congestion of egress traffic within the end host. We present zD, a framework for applying backpressure from a congested queue to traffic sources at end hosts that can scale to thousands of flows. We implement zD to apply backpressure in two settings: i) between TCP sources and kernel queuing discipline, and ii) between VMs as traffic sources and kernel queuing discipline in the hypervisor. zD improves throughput by up to 60%, and improves tail RTT by at least 10x at high loads, compared to standard kernel implementation.  more » « less
Award ID(s):
1816331
NSF-PAR ID:
10166577
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
CoNEXT '19: Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies
Page Range / eLocation ID:
220 to 232
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Virtual switches, used for end-host networking, drop packets when the receiving application is not fast enough to consume them. This is called the slow receiver problem, and it is important because packet loss hurts tail communication latency and wastes CPU cycles, resulting in application-level performance degradation. Further, solving this problem is challenging because application throughput is highly variable over short timescales as it depends on workload, memory contention, and OS thread scheduling. This paper presents Backdraft, a new lossless virtual switch that addresses the slow receiver problem by combining three new components: (1) Dynamic Per-Flow Queuing (DPFQ) to prevent HOL blocking and provide on-demand memory usage; (2) Doorbell queues to reduce CPU overheads; (3) A new overlay network to avoid congestion spreading. We implemented Backdraft on top of BESS and conducted experiments with real applications on a 100 Gbps cluster with both DCTCP and Homa, a state-of-the-art congestion control scheme. We show that an application with Backdraft can achieve up to 20x lower tail latency at the 99th percentile. 
    more » « less
  2. Advanced high-speed network cards have made packet processing in host operating systems a major performance bottleneck. The kernel network stack gives rise to various sources of overheads that limit the throughput and lengthen the per-packet processing latency. The problem is further exacerbated for short-lived, latency-sensitive network flows such as control packets, online gaming, database requests, etc. — in a highly utilized system, especially in virtualized (containerized) cloud environments, short flows can experience excessively long in-kernel queuing delays. As a consequence, recent research works propose to bypass the kernel network stack to enable lightweight, custom userspace network stacks for improved performance, but at a heavy cost of compatibility and security. In this paper, we take a different approach: We first analyze various sources of inefficiencies in the kernel network stack and propose ways to mitigate them without compromising systems compatibility, security, or flexibility. Further, we propose PRISM, a novel mechanism in the kernel network stack to differentiate incoming packets based on their performance requirements and streamline the processing stages of multi-stage packet processing pipelines (e.g., in container overlay networks). Our evaluation demonstrates that PRISM can significantly improve the latency of high-priority flows in container overly networks in the presence of heavy low-priority background traffic. 
    more » « less
  3. null (Ed.)
    To keep up with demand, servers will scale up to handle hundreds of thousands of clients simultaneously. Much of the focus of the community has been on scaling servers in terms of aggregate traffic intensity (packets transmitted per second). However, bottlenecks caused by the increasing number of concurrent clients, resulting in a large number of concurrent flows, have received little attention. In this work, we focus on identifying such bottlenecks. In particular, we define two broad categories of problems; namely, admitting more packets into the network stack than can be handled efficiently, and increasing per-packet overhead within the stack. We show that these problems contribute to high CPU usage and network performance degradation in terms of aggregate throughput and RTT. Our measurement and analysis are performed in the context of the Linux networking stack, the most widely used publicly available networking stack. Further, we discuss the relevance of our findings to other network stacks. The goal of our work is to highlight considerations required in the design of future networking stacks to enable efficient handling of large numbers of clients and flows 
    more » « less
  4. null (Ed.)
    This paper demonstrates that it is possible to achieve μs-scale latency using Linux kernel storage stack, even when tens of latency-sensitive applications compete for host resources with throughput-bound applications that perform read/write operations at throughput close to hardware capacity. Furthermore, such performance can be achieved without any modification in applications, network hardware, kernel CPU schedulers and/or kernel network stack. We demonstrate the above using design, implementation and evaluation of blk-switch, a new Linux kernel storage stack architecture. The key insight in blk-switch is that Linux's multi-queue storage design, along with multi-queue network and storage hardware, makes the storage stack conceptually similar to a network switch. blk-switch uses this insight to adapt techniques from the computer networking literature (e.g., multiple egress queues, prioritized processing of individual requests, load balancing, and switch scheduling) to the Linux kernel storage stack. blk-switch evaluation over a variety of scenarios shows that it consistently achieves μs-scale average and tail latency (at both 99th and 99.9th percentiles), while allowing applications to near-perfectly utilize the hardware capacity. 
    more » « less
  5. null (Ed.)
    This paper demonstrates that it is possible to achieve µs-scale latency using Linux kernel storage stack, even when tens of latency-sensitive applications compete for host resources with throughput-bound applications that perform read/write operations at throughput close to hardware capacity. Furthermore, such performance can be achieved without any modification in applications, network hardware, kernel CPU schedulers and/or kernel network stack. We demonstrate the above using design, implementation and evaluation of blk-switch, a new Linux kernel storage stack architecture. The key insight in blk-switch is that Linux’s multi-queue storage design, along with multi-queue network and storage hardware, makes the storage stack conceptually similar to a network switch. blk-switch uses this insight to adapt techniques from the computer networking literature (e.g., multiple egress queues, prioritized processing of individual requests, load balancing, and switch scheduling) to the Linux kernel storage stack. blk-switch evaluation over a variety of scenarios shows that it consistently achieves µs-scale average and tail latency (at both 99th and 99.9th percentiles), while allowing applications to near-perfectly utilize the hardware capacity. 
    more » « less