As scaling of conventional memory devices has stalled, many high-end computing systems have begun to incorporate alternative memory technologies to meet performance goals. Since these technologies present distinct advantages and tradeoffs compared to conventional DDR* SDRAM, such as higher bandwidth with lower capacity or vice versa, they are typically packaged alongside conventional SDRAM in a heterogeneous memory architecture. To utilize the different types of memory efficiently, new data management strategies are needed to match application usage to the best available memory technology. However, current proposals for managing heterogeneous memories are limited, because they either (1) do not consider high-level application behavior when assigning data to different types of memory or (2) require separate program execution (with a representative input) to collect information about how the application uses memory resources. This work presents a new data management toolset to address the limitations of existing approaches for managing complex memories. It extends the application runtime layer with automated monitoring and management routines that assign application data to the best tier of memory based on previous usage, without any need for source code modification or a separate profiling run. It evaluates this approach on a state-of-the-art server platform with both conventional DDR4 SDRAM and non-volatile Intel Optane DC memory, using both memory-intensive high-performance computing (HPC) applications as well as standard benchmarks. Overall, the results show that this approach improves program performance significantly compared to a standard unguided approach across a variety of workloads and system configurations. The HPC applications exhibit the largest benefits, with speedups ranging from 1.4× to 7× in the best cases. Additionally, we show that this approach achieves similar performance as a comparable offline profiling-based approach after a short startup period, without requiring separate program execution or offline analysis steps.
more »
« less
Performance Modeling and Evaluation of a Production Disaggregated Memory System
High performance computers rely on large memories to cache data and improve performance. However, managing the ever-increasing number of levels in the memory hierarchy becomes increasingly difficult. The Disaggregated Memory System (DMS) architecture was introduced in recent years for better memory utilization. DMS is a global memory pool between the local memories and storage. To leverage DMS, we need a better understanding of its performance and how to exploit its full potential. In this study, we first present a DMS performance model for performance evaluation and analysis. We next conduct a thorough performance evaluation to identify application-DMS characteristics under different system configurations. Experimental tests are conducted on the RAM Area Network (RAN), a DMS implementation available at the Argonne National Laboratory, for performance evaluation. Then, the results of performance experiments are presented along with an analysis of the pros and cons of the RAN-DMS design and implementation. The counterintuitive performance results for the K-means application are analyzed at code-level to illustrate DMS performance. Finally, based on our findings, we present some discussions on future DMS design and its potential on AI applications.
more »
« less
- PAR ID:
- 10251076
- Date Published:
- Journal Name:
- The International Symposium on Memory Systems
- Page Range / eLocation ID:
- 223 to 232
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
NextG cellular networks are designed to meet Quality of Service requirements for various applications in and beyond smartphones and mobile devices. However, lacking introspection into the 5G Radio Access Network (RAN) application and transport layer designers are ill-poised to cope with the vagaries of the wireless last hop to a mobile client, while 5G network operators run mostly closed networks, limiting their potential for co-design with the wider internet and user applications. This paper presents NR-Scope, a passive, incrementally-deployable, and independently-deployable Standalone 5G network telemetry system that can stream fine-grained RAN capacity, latency, and retransmission information to application servers to enable better millisecond scale, application-level decisions on offered load and bit rate adaptation than end-to-end latency measurements or end-to-end packet losses currently permit. Our experimental evaluation on various 5G Standalone base stations demonstrates NR-Scope can achieve less than 0.1% throughput error estimation for every UE in a RAN. The code is available at https://github.com/PrincetonUniversity/NR-Scope.more » « less
-
5G New Radio cellular networks are designed to provide high Quality of Service for application on wirelessly connected devices. However, changing conditions of the wireless last hop can degrade application performance, and the applications have no visibility into the 5G Radio Access Network (RAN). Most 5G network operators run closed networks, limiting the potential for co-design with the wider-area internet and user applications. This paper demonstrates NR-Scope, a passive, incrementally-deployable, and independently-deployable Standalone 5G network telemetry system that can passively measure fine-grained RAN capacity, latency, and retransmission information. Application servers can take advantage of the measurements to achieve better millisecond scale, application-level decisions on offered load and bit rate adaptation than end-to-end latency measurements or end-to-end packet losses currently permit. We demonstrate the performance of NR-Scope by decoding the downlink control information (DCI) for downlink and uplink traffic of a 5G Standalone base station in real-time.more » « less
-
To process real-world datasets, modern data-parallel systems often require extremely large amounts of memory, which are both costly and energy inefficient. Emerging non-volatile memory (NVM) technologies offer high capacity compared to DRAM and low energy compared to SSDs. Hence, NVMs have the potential to fundamentally change the dichotomy between DRAM and durable storage in Big Data processing. However, most Big Data applications are written in managed languages and executed on top of a managed runtime that already performs various dimensions of memory management. Supporting hybrid physical memories adds a new dimension, creating unique challenges in data replacement. This article proposes Panthera, a semantics-aware, fully automated memory management technique for Big Data processing over hybrid memories. Panthera analyzes user programs on a Big Data system to infer their coarse-grained access patterns, which are then passed to the Panthera runtime for efficient data placement and migration. For Big Data applications, the coarse-grained data division information is accurate enough to guide the GC for data layout, which hardly incurs overhead in data monitoring and moving. We implemented Panthera in OpenJDK and Apache Spark. Based on Big Data applications’ memory access pattern, we also implemented a new profiling-guided optimization strategy, which is transparent to applications. With this optimization, our extensive evaluation demonstrates that Panthera reduces energy by 32–53% at less than 1% time overhead on average. To show Panthera’s applicability, we extend it to QuickCached, a pure Java implementation of Memcached. Our evaluation results show that Panthera reduces energy by 28.7% at 5.2% time overhead on average.more » « less
-
For hosting data-serving and caching workloads based on key-value stores in clouds, the cost of memory represents a significant portion of the hosting expenses. The emergence of cheaper, but slower, types of memories, such as NVDIMMs, opens opportunities to reduce the hosting costs for such workloads. The question explored in this paper is how to determine adequate allocations of different memory types in future systems with heterogeneous memory components, so as to retain desired performance SLOs and maximize the cost efficiency of the memory resource. We develop Mnemo, a memory sizing and data tiering consultant, that permits quick exploration of the cost-benefit tradeoffs associated with different configurations of the hybrid memory components used by key-value store workloads. Using experimental evaluation with different workload patterns, Mnemo is able to afford applications such as Redis, Memcached and DynamoDB, with substantial reduction in their hosting costs, at negligible impact on application performance, thus improving the overall system memory cost efficiency.more » « less
An official website of the United States government

