skip to main content

Title: Semeru: A Memory-Disaggregated Managed Runtime
Resource-disaggregated architectures have risen in popularity for large datacenters. However, prior disaggregation systems are designed for native applications; in addition, all of them require applications to possess excellent locality to be efficiently executed. In contrast, programs written in managed languages are subject to periodic garbage collection (GC), which is a typical graph workload with poor locality. Although most datacenter applications are written in managed languages, current systems are far from delivering acceptable performance for these applications. This paper presents Semeru, a distributed JVM that can dramatically improve the performance of managed cloud applications in a memory-disaggregated environment. Its design possesses three major innovations: (1) a universal Java heap, which provides a unified abstraction of virtual memory across CPU and memory servers and allows any legacy program to run without modifications; (2) a distributed GC, which offloads object tracing to memory servers so that tracing is performed closer to data; and (3) a swap system in the OS kernel that works with the runtime to swap page data efficiently. An evaluation of Semeru on a set of widely-deployed systems shows very promising results.
; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4-6, 2020
Sponsoring Org:
National Science Foundation
More Like this
  1. Far-memory techniques that enable applications to use remote memory are increasingly appealing in modern datacenters, supporting applications’ large memory footprint and improving machines’ resource utilization. Unfortunately, most far-memory techniques focus on OS-level optimizations and are agnostic to managed runtimes and garbage collections (GC) underneath applications written in high-level languages. With different object-access patterns from applications, GC can severely interfere with existing far-memory techniques, breaking prefetching algorithms and causing severe local-memory misses. We developed MemLiner, a runtime technique that improves the performance of far-memory systems by “lining up” memory accesses from the application and the GC so that they follow similar memory access paths, thereby (1)reducing the local-memory working set and (2) improving remote-memory prefetching through simplified memory access patterns. We implemented MemLiner in two widely-used GCs in OpenJDK: G1 and Shenandoah. Our evaluation with a range of widely-deployed cloud systems shows MemLiner improves applications’ end-to-end performance by up to 2.5x.
  2. To process real-world datasets, modern data-parallel systems often require extremely large amounts of memory, which are both costly and energy inefficient. Emerging non-volatile memory (NVM) technologies offer high capacity compared to DRAM and low energy compared to SSDs. Hence, NVMs have the potential to fundamentally change the dichotomy between DRAM and durable storage in Big Data processing. However, most Big Data applications are written in managed languages and executed on top of a managed runtime that already performs various dimensions of memory management. Supporting hybrid physical memories adds a new dimension, creating unique challenges in data replacement. This article proposes Panthera, a semantics-aware, fully automated memory management technique for Big Data processing over hybrid memories. Panthera analyzes user programs on a Big Data system to infer their coarse-grained access patterns, which are then passed to the Panthera runtime for efficient data placement and migration. For Big Data applications, the coarse-grained data division information is accurate enough to guide the GC for data layout, which hardly incurs overhead in data monitoring and moving. We implemented Panthera in OpenJDK and Apache Spark. Based on Big Data applications’ memory access pattern, we also implemented a new profiling-guided optimization strategy, which is transparent tomore »applications. With this optimization, our extensive evaluation demonstrates that Panthera reduces energy by 32–53% at less than 1% time overhead on average. To show Panthera’s applicability, we extend it to QuickCached, a pure Java implementation of Memcached. Our evaluation results show that Panthera reduces energy by 28.7% at 5.2% time overhead on average.« less
  3. Managed programming languages including Java and Scala are very popular for data analytics and mobile applications. However, they often face challenging issues due to the overhead caused by the automatic memory management to detect and reclaim free available memory. It has been observed that during their Garbage Collection (GC), excessively long pauses can account for up to 40 % of the total execution time. Therefore, mitigating the GC overhead has been an active research topic to satisfy today's application requirements. This paper proposes a new technique called SwapVA to improve data copying in the copying/moving phases of GCs and reduce the GC pause time, thereby mitigating the issue of GC overhead. Our contribution is twofold. First, a SwapVA system call is introduced as a zero-copy technique to accelerate the GC copying/moving phase. Second, for the demonstration of its effectiveness, we have integrated SwapVA into SVAGC as an implementation of scalable Full GC on multi-core systems. Based on our results, the proposed solutions can dramatically reduce the GC pause in applications with large objects by as much as 70.9% and 97%, respectively, in the Sparse.large/4 (one quarter of the default input size) and Sigverify benchmarks.
  4. While permissioned blockchains enable a family of data center applications, existing systems suffer from imbalanced loads across compute and memory, exacerbating the underutilization of cloud resources. This paper presents FlexChain , a novel permissioned blockchain system that addresses this challenge by physically disaggregating CPUs, DRAM, and storage devices to process different blockchain workloads efficiently. Disaggregation allows blockchain service providers to upgrade and expand hardware resources independently to support a wide range of smart contracts with diverse CPU and memory demands. Moreover, it ensures efficient resource utilization and hence prevents resource fragmentation in a data center. We have explored the design of XOV blockchain systems in a disaggregated fashion and developed a tiered key-value store that can elastically scale its memory and storage. Our design significantly speeds up the execution stage. We have also leveraged several techniques to parallelize the validation stage in FlexChain to further improve the overall blockchain performance. Our evaluation results show that FlexChain can provide independent compute and memory scalability, while incurring at most 12.8% disaggregation overhead. FlexChain achieves almost identical throughput as the state-of-the-art distributed approaches with significantly lower memory and CPU consumption for compute-intensive and memory-intensive workloads respectively.
  5. Managed languages such as Java and Scala are prevalently used in development of large-scale distributed systems. Under the managed runtime, when performing data transfer across machines, a task frequently conducted in a Big Data system, the system needs to serialize a sea of objects into a byte sequence before sending them over the network. The remote node receiving the bytes then deserializes them back into objects. This process is both performance-inefficient and labor-intensive: (1) object serialization/deserialization makes heavy use of reflection, an expensive runtime operation and/or (2) serialization/deserialization functions need to be hand-written and are error-prone. This paper presents Skyway, a JVM-based technique that can directly connect managed heaps of different (local or remote) JVM processes. Under Skyway, objects in the source heap can be directly written into a remote heap without changing their formats. Skyway provides performance benefits to any JVM-based system by completely eliminating the need (1) of invoking serialization/deserialization functions, thus saving CPU time, and (2) of requiring developers to hand-write serialization functions.