skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Understanding and optimizing persistent memory allocation
The proliferation of fast, dense, byte-addressable nonvolatile memory suggests that data might be kept in pointer-rich “in-memory” format across program runs and even process and system crashes. For full generality, such data requires dynamic memory allocation, and while the allocator could in principle be “rolled into” each data structure, it is desirableto make it a separate abstraction. Toward this end, we introduce _recoverability_, a correctness criterion for persistent allocators, together with a nonblocking allocator, _Ralloc,_ that satisfies this criterion. Ralloc is based on the _LRMalloc_ of Leite and Rocha, with four key innovations: First, we persist just enough information during normal operation to permit a garbage collection (GC) pass to correctly reconstruct the heap in the wake of a full-system crash. Second, we introduce the notion of _filter functions_, which identify the locations of pointers within persistent blocks to mitigate the limitations of conservative GC. Third, we reorganize the layout of the heap to facilitate the incremental allocation of physical space. Fourth, we employ position-independent (offset-based) pointers to allow persistent regions to be mapped at an arbitrary address. Experiments show Ralloc to be performance-competitive with both _Makalu_, the state-of-the-art lock-based persistent allocator, and such transient allocators as LRMalloc and JEMalloc. In particular, reliance on GC and offline metadata reconstruction allows Ralloc to pay almost nothing for persistence during normal operation.  more » « less
Award ID(s):
1422649 1717712 1900803
PAR ID:
10187404
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
International Symposium on Memory Management
Page Range / eLocation ID:
60 to 73
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The proliferation of fast, dense, byte-addressable nonvolatile memory suggests the possibility of keeping data in pointer-rich “in-memory” format across program runs and even crashes. For full generality, such data requires dynamic memory allocation. Toward this end, we introduce _recoverability_, a correctness criterion for persistent allocators, together with a nonblocking allocator, _Ralloc_, that satisfies this criterion. Ralloc is based on _LRMalloc_, with three key innovations. First, we persist just enough information during normal operation to permit reconstruction of the heap after a full-system crash. Our reconstruction mechanism performs garbage collection (GC) to identify and remedy any failure-induced memory leaks. Second, in support of GC, we introduce the notion of _filter functions_, which identify the locations of pointers within persistent blocks. Third, to allow persistent regions to be mapped at an arbitrary address, we employ the position-independent pointer representation of Chen et al., both in data and in allocator metadata. Experiments show that Ralloc provides scalable performance competitive to that of both _Makalu_, the state-of-the-art lock-based persistent allocator, and the best transient allocators (e.g.,_JEMalloc_). 
    more » « less
  2. Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this article, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on six benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4× and 57× for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9× speedup in multi-socket evaluation and is up to 36× faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1×. 
    more » « less
  3. Serverless computing is an increasingly attractive paradigm in the cloud due to its ease of use and fine-grained pay-for-what-you-use billing. However, serverless computing poses new challenges to system design due to its short-lived function execution model. Our detailed analysis reveals that memory management is responsible for a major amount of function execution cycles. This is because functions pay the full critical-path costs of memory management in both userspace and the operating system without the opportunity to amortize these costs over their short lifetimes. To address this problem, we propose Memento, a new hardware-centric memory management design based upon our insights that memory allocations in serverless functions are typically small, and either quickly freed after allocation or freed when the function exits. Memento alleviates the overheads of serverless memory management by introducing two key mechanisms: (i) a hardware object allocator that performs in-cache memory allocation and free operations based on arenas, and (ii) a hardware page allocator that manages a small pool of physical pages used to replenish arenas of the object allocator. Together these mechanisms alleviate memory management overheads and bypass costly userspace and kernel operations. Memento naturally integrates with existing software stacks through a set of ISA extensions that enable seamless integration with multiple languages runtimes. Finally, Memento leverages the newly exposed memory allocation semantics in hardware to introduce a main memory bypass mechanism and avoid unnecessary DRAM accesses for newly allocated objects. We evaluate Memento with full-system simulations across a diverse set of containerized serverless workloads and language runtimes. The results show that Memento achieves function execution speedups ranging between 8–28% and 16% on average. Furthermore, Memento hardware allocators and main memory bypass mechanisms drastically reduce main memory traffic by 30% on average. The combined effects of Memento reduce the pricing cost of function execution by 29%. Finally, we demonstrate the applicability of Memento beyond functions, to major serverless platform operations and long-running data processing applications. 
    more » « less
  4. The memory allocator plays a key role in the performance of applications, but none of the existing profilers can pinpoint performance slowdowns caused by a memory allocator. Consequently, programmers may spend time improving application code incorrectly or unnecessarily, achieving low or no performance improvement. This paper designs the first profiler—MemPerf—to identify allocator-induced performance slowdowns without comparing against another allocator. Based on the key observation that an allocator may impact the whole life-cycle of heap objects, including the accesses (or uses) of these objects, MemPerf proposes a life-cycle based detection to identify slowdowns caused by slow memory management operations and slow accesses separately. For the prior one, MemPerf proposes a thread-aware and type-aware performance modeling to identify slow management operations. For slow memory accesses, MemPerf utilizes a top-down approach to identify all possible reasons for slow memory accesses introduced by the allocator, mainly due to cache and TLB misses, and further proposes a unified method to identify them correctly and efficiently. Based on our extensive evaluation, MemPerf reports 98% medium and large allocator-reduced slowdowns (larger than 5%) correctly without reporting any false positives. MemPerf also pinpoints multiple known and unknown design issues in widely-used allocators. 
    more » « less
  5. Garbage collection (GC) support for unmanaged languages can reduce programming burden in reasoning about liveness of dynamic objects. It also avoids temporal memory safety violations and memory leaks.SoundGC for weakly-typed languages such as C/C++, however, remains an unsolved problem. Current value-based GC solutions examine values of memory locations to discover the pointers, and the objects they point to. The approach is inherently unsound in the presence of arbitrary type casts and pointer manipulations, which are legal in C/C++. Such language features are regularly used, especially in low-level systems code. In this paper, we propose Dynamic Pointer Provenance Tracking to realize sound GC. We observe that pointers cannot be created out-of-thin-air, and they must have provenance to at least one valid allocation. Therefore, by tracking pointer provenance from the source (e.g., malloc) through both explicit data-flow and implicit control-flow, our GC has sound and precise information to compute the set of all reachable objects at any program state. We discuss several static analysis optimizations, that can be employed during compilation aided with profiling, to significantly reduce the overhead of dynamic provenance tracking from nearly 8× to 16% for well-behaved programs that adhere to the C standards. Pointer provenance based sound GC invocation is also 13% faster and reclaims 6% more memory on average, compared to an unsound value-based GC. 
    more » « less