skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 3, 2026

Title: EDM: An Ultra-Low Latency Ethernet Fabric for Memory Disaggregation
Achieving low remote memory access latency remains the primary challenge in realizing memory disaggregation over Ethernet within the datacenters. We present EDM that attempts to overcome this challenge using two key ideas. First, while existing network protocols for remote memory access over the Ethernet, such as TCP/IP and RDMA, are implemented on top of the Ethernet MAC layer, EDM takes a radical approach by implementing the entire network protocol stack for remote memory access within the Physical layer (PHY) of the Ethernet. This overcomes fundamental latency and bandwidth overheads imposed by the MAC layer, especially for small memory messages. Second, EDM implements a centralized, fast, in-network scheduler for memory traffic within the PHY of the Ethernet switch. Inspired by the classic Parallel Iterative Matching (PIM) algorithm, the scheduler dynamically reserves bandwidth between compute and memory nodes by creating virtual circuits in the PHY, thus eliminating queuing delay and layer 2 packet processing delay at the switch for memory traffic, while maintaining high bandwidth utilization. Our FPGA testbed demonstrates that EDM's network fabric incurs a latency of only ~300 ns for remote memory access in an unloaded network, which is an order of magnitude lower than state-of-the-art Ethernet-based solutions such as RoCEv2 and comparable to emerging PCIe-based solutions such as CXL. Larger-scale network simulations indicate that even at high network loads, EDM's average latency remains within 1.3x its unloaded latency.  more » « less
Award ID(s):
2239829 2346605
PAR ID:
10570726
Author(s) / Creator(s):
;
Publisher / Repository:
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS)
Date Published:
ISBN:
979-8-4007-0698-1
Page Range / eLocation ID:
377-394
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Future tactical communications involves high data rate best effort traffic working alongside real-time traffic for time-critical applications with hard deadlines. Unavailable bandwidth and/or untimely responses may lead to undesired or even catastrophic outcomes. Ethernet-based communication systems are one of the major tactical network standards due to the higher bandwidth, better utilization, and ability to handle heterogeneous traffic. However, Ethernet suffers from inconsistent performance for jitter, latency and bandwidth under heavy loads. The emerging Time-Triggered Ethernet (TTE) solutions promise deterministic Ethernet performance, fault-tolerant topologies and real-time guarantees for critical traffic. In this paper we study the TTE protocol and build a TTTech TTE test bed to evaluate its performance. Through experimental study, the TTE protocol was observed to provide consistent high data rates for best effort messages, determinism with very low jitter for time-triggered messages, and fault-tolerance for minimal packet loss using redundant networking topologies. In addition, challenges were observed that presented a trade-off between the integration cycle and the synchronization overhead. It is concluded that TTE is a capable solution to support heterogeneous traffic in time-critical applications, such as aerospace systems (eg. airplanes, spacecraft, etc.), ground-based vehicles (eg. trains, buses, cars, etc), and cyber-physical systems (eg. smart-grids, IoT, etc.). 
    more » « less
  2. Despite advances in network security, attacks targeting mission critical systems and applications remain a significant problem for network and datacenter providers. Existing telemetry platforms detect volumetric attacks at terabit scales using approximation techniques and coarse grain analysis. However, the prevalence of low and slow attacks that require very little bandwidth, makes flow-state tracking critical to overall attack mitigation. Traffic queries deployed on network switches are often limited by hardware constraints, preventing them from carrying out flow tracking features required to detect stealthy attacks. Such attacks can go undetected in the midst of high traffic volumes. We design SmartWatch, a novel flow state tracking and flow logging system at line rate, using SmartNICs to optimize performance and simultaneously detect a number of stealthy attacks. SmartWatch leverages advances in switch based network telemetry platforms to process the bulk of the traffic and only forward suspicious traffic subsets to the SmartNIC. The programmable network switches perform coarse-grained traffic analysis while the SmartNIC conducts the finer-grained analysis which involves additional processing of the packet as a 'bump-in-the-wire'. A control loop between the SmartNIC and programmable switch tunes the queries performed in the switch to direct the most appropriate traffic subset to the SmartNIC. SmartWatch's cooperative monitoring approach yields 2.39 times better detection rate compared to existing platforms deployed on programmable switches. SmartWatch can detect covert timing channels and perform website fingerprinting more efficiently compared to standalone programmable switch solutions, relieving switch memory and control-plane processor resources. Compared to host-based approaches, SmartWatch can reduce the packet processing latency by 72.32%. 
    more » « less
  3. Three-dimensional (3D)-stacked memories, such as the Hybrid Memory Cube (HMC), provide a promising solution for overcoming the bandwidth wall between processors and memory by integrating memory and logic dies in a single stack. Such memories also utilize a network-on-chip (NoC) to connect their internal structural elements and to enable scalability. This novel usage of NoCs enables numerous benefits such as high bandwidth and memory-level parallelism and creates future possibilities for efficient processing-in-memory techniques. However, the implications of such NoC integration on the performance characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap (i) by characterizing an HMC prototype using Micron's AC-510 accelerator board and by revealing its access latency and bandwidth behaviors; and (ii) by investigating the implications of such behaviors on system- and software-level designs. Compared to traditional DDR-based memories, our examinations reveal the performance impacts of NoCs for current and future 3D-stacked memories and demonstrate how the packet-based protocol, internal queuing characteristics, traffic conditions, and other unique features of the HMC affects the performance of applications. 
    more » « less
  4. Recently, switched Ethernet has become increasingly popular in networked cyber-physical systems (NCPS). In an Ethernet-based NCPS, network-connected devices (e.g., sensors and actuators) realize time-critical tasks by exchanging miscellaneous information, such as sensor readings and control commands. To ensure reliable control and operation, network-induced delays for time-critical NCPS applications must be carefully examined. In this work, we propose a framework combining network delay measurements and network-calculus-based delay performance analysis to obtain accurate, deterministic worst-case delay bounds for NCPS. By modeling traffic sources and networking devices (e.g., Ethernet switches) through measurements, we establish accurate traffic and device models for network-calculus-based analysis. To obtain worst-case delay bounds, different network-calculus-based analytical methods can be leveraged, allowing CPS architects to customize the proposed delay analysis framework to suit application-specific needs. Our evaluation results show that the proposed approach derives accurate delay bounds, making it a valuable tool for architects designing NCPSs supporting time-critical applications. 
    more » « less
  5. The growing number of network functions drives the need to install increasing numbers of fine-grained packet classification rules in the network switches. However, this demand for rules is outstripping the size of switch memory. While much work has focused on compressing classification rules, most of this work proposes solutions operating in the memory of a single switch. This paper proposed, instead, a collaborative approach encompassing switches and network functions: we couple approximate classification at switches with fine-grained filtering where needed at network functions to accomplish overall classification. This architecture enables a trade-off between usage of (expensive) switch memory and (cheaper) downstream network bandwidth and network function resources. Our implementation uses approximate classification and Prefiltering to reduce switch memory usage. Our system can reduce memory significantly compared to a strawman approach, as shown by simulations of real traffic traces and rules. 
    more » « less