Compute and memory are tightly coupled within each server in traditional datacenters. Large-scale datacenter operators have identified this coupling as a root cause behind fleetwide resource underutilization and increasing Total Cost of Ownership (TCO). With the advent of ultra-fast networks and cache-coherent interfaces, memory disaggregation has emerged as a potential solution, whereby applications can leverage available memory even outside server boundaries. This paper summarizes the growing research landscape of memory disaggregation from a software perspective and introduces the challenges toward making it practical under current and future hardware trends. We also reflect on our seven-year journey in the SymbioticLab to build a comprehensive disaggregated memory system over ultra-fast networks. We conclude with some open challenges toward building next-generation memory disaggregation systems leveraging emerging cache-coherent interconnects.
more »
« less
Evaluating Hardware Memory Disaggregation under Delay and Contention
Hardware memory disaggregation is an emerging trend in datacenters that provides access to remote memory as part of a shared pool or unused memory on machines across the network. Memory disaggregation aims to improve memory utilization and scale memory-intensive applications. Current state-of-the-art prototypes have shown that hardware disaggregated memory is a reality at the rack-scale. However, the memory utilization benefits of memory disaggregation can only be fully realized at larger scales enabled by a datacenter-wide network. Introduction of a datacenter network results in new performance and reliability failures that may manifest as higher network latency. Additionally, sharing of the network introduces new points of contention between multiple applications. In this work, we characterize the impact of variable network latency and contention in an open-source hardware disaggregated memory prototype - ThymesisFlow. To support our characterization, we have developed a delay injection framework that introduces delays in remote memory access to emulate network latency. Based on the characterization results, we develop insights into how reliability and resource allocation mechanisms should evolve to support hardware memory disaggregation beyond rack-scale in datacenters.
more »
« less
- Award ID(s):
- 2029049
- PAR ID:
- 10358763
- Date Published:
- Journal Name:
- 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
- Page Range / eLocation ID:
- 1221 to 1227
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Passive remote memory remains the holy grail of disaggregation. Most existing systems for disaggregated memory either use remote memory simply as a backing store, or design special-purpose data structures that require some amount of processing co-resident with the remote memory to manage and apply updates. The few proposals for truly passive remote memory perform well only with read-mostly workloads, rapidly deteriorating in the face of even low levels of write contention. We propose to leverage in-network devices (specifically, a programmable top-of-rack switch) to serialize remote memory accesses and resolve any write conflicts in flight. Our prototype is able to completely avoid write contention in the recently published Clover disaggregated key/value store, delivering a performance boost of almost 50% on our testbed under a mixed read/write workload.more » « less
-
Achieving low remote memory access latency remains the primary challenge in realizing memory disaggregation over Ethernet within the datacenters. We present EDM that attempts to overcome this challenge using two key ideas. First, while existing network protocols for remote memory access over the Ethernet, such as TCP/IP and RDMA, are implemented on top of the Ethernet MAC layer, EDM takes a radical approach by implementing the entire network protocol stack for remote memory access within the Physical layer (PHY) of the Ethernet. This overcomes fundamental latency and bandwidth overheads imposed by the MAC layer, especially for small memory messages. Second, EDM implements a centralized, fast, in-network scheduler for memory traffic within the PHY of the Ethernet switch. Inspired by the classic Parallel Iterative Matching (PIM) algorithm, the scheduler dynamically reserves bandwidth between compute and memory nodes by creating virtual circuits in the PHY, thus eliminating queuing delay and layer 2 packet processing delay at the switch for memory traffic, while maintaining high bandwidth utilization. Our FPGA testbed demonstrates that EDM's network fabric incurs a latency of only ~300 ns for remote memory access in an unloaded network, which is an order of magnitude lower than state-of-the-art Ethernet-based solutions such as RoCEv2 and comparable to emerging PCIe-based solutions such as CXL. Larger-scale network simulations indicate that even at high network loads, EDM's average latency remains within 1.3x its unloaded latency.more » « less
-
Datacenter capacity is growing exponentially to satisfy the increasing demand for many emerging computationally-intensive applications, such as deep learning. This trend has led to concerns over datacenters’ increasing energy consumption and carbon footprint. The most basic prerequisite for optimizing a datacenter’s energy- and carbon-efficiency is accurately monitoring and attributing energy consumption to specific users and applications. Since datacenter servers tend to be multi-tenant, i.e., they host many applications, server- and rack-level power monitoring alone does not provide insight into the energy usage and carbon emissions of their resident applications. At the same time, current application-level energy monitoring and attribution techniques are intrusive: they require privileged access to servers and necessitate coordinated support in hardware and software, neither of which is always possible in cloud environments. To address the problem, we design WattScope, a system for non-intrusively estimating the power consumption of individual applications using external measurements of a server’s aggregate power usage and without requiring direct access to the server’s operating system or applications. Our key insight is that, based on an analysis of production traces, the power characteristics of datacenter workloads, e.g., low variability, low magnitude, and high periodicity, are highly amenable to disaggregation of a server’s total power consumption into application-specific values. WattScope adapts and extends a machine learning-based technique for disaggregating building power and applies it to server- and rack-level power meter measurements that are already available in data centers. We evaluate WattScope’s accuracy on a production workload and show that it yields high accuracy, e.g., often 10% normalized mean absolute error, and is thus a potentially useful tool for datacenters in externally monitoring application-level power usage.more » « less
-
null (Ed.)Datacenter disaggregation provides numerous benefits to both the datacenter operator and the application designer. However switching from the server-centric model to a disaggregated model requires developing new programming abstractions that can achieve high performance while benefiting from the greater elasticity. To explore the limits of datacenter disaggregation, we study an application area that near-maximally benefits from current server-centric datacenters: dense linear algebra. We build NumPyWren, a system for linear algebra built on a disaggregated serverless programming model, and LAmbdaPACK, a companion domain-specific language designed for serverless execution of highly parallel linear algebra algorithms. We show that, for a number of linear algebra algorithms such as matrix multiply, singular value decomposition, Cholesky decomposition, and QR decomposition, NumPyWren's performance (completion time) is within a factor of 2 of optimized server-centric MPI implementations, and has up to 15% greater compute efficiency (total CPU-hours), while providing fault tolerance.more » « less
An official website of the United States government

