skip to main content


Title: Annulus: A Dual Congestion Control Loop for Datacenter and WAN Traffic Aggregates
Cloud services are deployed in datacenters connected though high-bandwidth Wide Area Networks (WANs). We find that WAN traffic negatively impacts the performance of datacenter traffic, increasing tail latency by 2.5x, despite its small bandwidth demand. This behavior is caused by the long round-trip time (RTT) for WAN traffic, combined with limited buffering in datacenter switches. The long WAN RTT forces datacenter traffic to take the full burden of reacting to congestion. Furthermore, datacenter traffic changes on a faster time-scale than the WAN RTT, making it difficult for WAN congestion control to estimate available bandwidth accurately. We present Annulus, a congestion control scheme that relies on two control loops to address these challenges. One control loop leverages existing congestion control algorithms for bottlenecks where there is only one type of traffic (i.e., WAN or datacenter). The other loop handles bottlenecks shared between WAN and datacenter traffic near the traffic source, using direct feedback from the bottleneck. We implement Annulus on a testbed and in simulation. Compared to baselines using BBR for WAN congestion control and DCTCP or DCQCN for datacenter congestion control, Annulus increases bottleneck utilization by 10% and lowers datacenter flow completion time by 1.3-3.5x.  more » « less
Award ID(s):
1816331
NSF-PAR ID:
10166612
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of ACM SIGCOMM
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The alpha version of Bottleneck Bandwidth and Round-trip Time version 2 (BBRv2) has been recently presented, which aims to mitigate the shortcomings of its predecessor, BBR version 1 (BBRv1). Previous studies show that BBRv1 provides a high link utilization and low queuing delay by estimating the available bottleneck bandwidth. However, its aggressiveness induces unfairness when flows i) use different congestion control algorithms, such as CUBIC, and ii) have distinct round-trip times (RTTs). This paper presents an experimental evaluation of BBRv2, using Mininet. Results show that the coexistence between BBRv2-CUBIC is enhanced with respect to that of BBRv1-CUBIC, as measured by the fairness index. They also show that BBRv2 mitigates the RTT unfairness problem observed in BBRv1. Additionally, BBRv2 achieves a better fair share of the bandwidth than its predecessor when network conditions such as bandwidth and latency dynamically change. Results also indicate that the average flow completion time of concurrent flows is reduced when BBRv2 is used. 
    more » « less
  2. This work investigates traffic control via controlled connected and automated vehicles (CAVs) using novel controllers derived from the linear-quadratic regulator (LQR) theory. CAV-platoons are modeled as moving bottlenecks impacting the surrounding traffic with their speeds as control inputs. An iterative controller algorithm based on the LQR theory is proposed along with a variant that allows for penalizing abrupt changes in platoon speeds. The controllers use the Lighthill-Whitham-Richards (LWR) model implemented using an extended cell transmission model (CTM) which considers the capacity drop phenomenon for a realistic representation of traffic in congestion. The impact of various parameters of the proposed controller on the control performance is analyzed. The effectiveness of the proposed traffic control algorithms is tested using a traffic control example and compared with existing proportional-integral (PI) and model predictive control (MPC) controllers from the literature. A case study using the TransModeler traffic microsimulation software is conducted to test the usability of the proposed controller as well as existing controllers in a realistic setting and derive qualitative insights. It is observed that the proposed controller works well in both settings to mitigate the impact of the jam caused by a fixed bottleneck. The computation time required by the controller is also small making it suitable for real-time control.

     
    more » « less
  3. BBR is a new congestion control algorithm proposed by Google that builds a model of the network path consisting of its bottleneck bandwidth and RTT to govern its sending rate rather than packet loss (like CUBIC and many other popular congestion control algorithms). Loss-based congestion control has been shown to be vulnerable to acknowledgment manipulation attacks. However, no prior work has investigated how to design such attacks for BBR, nor how effective they are in practice. In this paper we systematically analyze the vulnerability of BBR to acknowledgement manipulation attacks. We create the first detailed BBR finite state machine and a novel algorithm for inferring its current BBR state at runtime by passively observing network traffic.We then adapt and apply a TCP fuzzer to the Linux TCP BBR v1.0 implementation. Our approach generated 30,297 attack strategies, of which 8,859 misled BBR about actual network conditions. From these, we identify 5 classes of attacks causing BBR to send faster, slower or stall. We also found that BBR is immune to acknowledgment burst, division and duplication attacks that were previously shown to be effective against loss-based congestion control such as TCP New Reno. 
    more » « less
  4. Despite advances in network security, attacks targeting mission critical systems and applications remain a significant problem for network and datacenter providers. Existing telemetry platforms detect volumetric attacks at terabit scales using approximation techniques and coarse grain analysis. However, the prevalence of low and slow attacks that require very little bandwidth, makes flow-state tracking critical to overall attack mitigation. Traffic queries deployed on network switches are often limited by hardware constraints, preventing them from carrying out flow tracking features required to detect stealthy attacks. Such attacks can go undetected in the midst of high traffic volumes. We design SmartWatch, a novel flow state tracking and flow logging system at line rate, using SmartNICs to optimize performance and simultaneously detect a number of stealthy attacks. SmartWatch leverages advances in switch based network telemetry platforms to process the bulk of the traffic and only forward suspicious traffic subsets to the SmartNIC. The programmable network switches perform coarse-grained traffic analysis while the SmartNIC conducts the finer-grained analysis which involves additional processing of the packet as a 'bump-in-the-wire'. A control loop between the SmartNIC and programmable switch tunes the queries performed in the switch to direct the most appropriate traffic subset to the SmartNIC. SmartWatch's cooperative monitoring approach yields 2.39 times better detection rate compared to existing platforms deployed on programmable switches. SmartWatch can detect covert timing channels and perform website fingerprinting more efficiently compared to standalone programmable switch solutions, relieving switch memory and control-plane processor resources. Compared to host-based approaches, SmartWatch can reduce the packet processing latency by 72.32%. 
    more » « less
  5. Recent years have seen a slew of papers on datacenter congestion control mechanisms. In this editorial, we ask whether the bulk of this research is needed for the common case where congestion control involves hosts responding to simple congestion signals from the network and the performance goal is reducing some average measure of Flow Completion Time. We raise this question because we find that, out of all the possible variations one could make in congestion control algorithms, the most essential feature is the switch scheduling algorithm. More specifically, we find that congestion control mechanisms that use Shortest-Remaining-Processing-Time (SRPT) achieve superior performance as long as the rate-setting algorithm at the host is reasonable. We further find that while SRPT’s performance is quite robust to host behaviors, the performance of schemes that use scheduling algorithms like FIFO or Fair Queuing depend far more crucially on the rate-setting algorithm, and their performance is typically worse than what can be achieved with SRPT. Given these findings, we then ask whether it is practical to realize SRPT in switches without requiring custom hardware. We observe that approximate and deployable SRPT (ADS) designs exist, which leverage the small number of priority queues supported in almost all commodity switches, and require only software changes in the host and the switches. Our evaluations with one very simple ADS design shows that it can achieve performance close to true SRPT and is significantly better than FIFO. Thus, the answer to our basic question – whether the bulk of recent research on datacenter congestion control algorithms is needed for the common case – is no. 
    more » « less