The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the coop- eration of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allo- cators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc , that is de- signed for the NUMA architecture. NUMAlloc is centered on a binding-based memory management. On top of it, NUMAl- loc proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evalua- tion, NUMAlloc hasthebestperformanceamongallevaluated allocators, running 15.7% faster than the second-best allo- cator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.
more »
« less
NUMAlloc: A Faster NUMA Memory Allocator
The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.
more »
« less
- Award ID(s):
- 2215193
- PAR ID:
- 10514923
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400701795
- Page Range / eLocation ID:
- 97 to 110
- Format(s):
- Medium: X
- Location:
- Orlando FL USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The memory allocator plays a key role in the performance of applications, but none of the existing profilers can pinpoint performance slowdowns caused by a memory allocator. Consequently, programmers may spend time improving application code incorrectly or unnecessarily, achieving low or no performance improvement. This paper designs the first profiler—MemPerf—to identify allocator-induced performance slowdowns without comparing against another allocator. Based on the key observation that an allocator may impact the whole life-cycle of heap objects, including the accesses (or uses) of these objects, MemPerf proposes a life-cycle based detection to identify slowdowns caused by slow memory management operations and slow accesses separately. For the prior one, MemPerf proposes a thread-aware and type-aware performance modeling to identify slow management operations. For slow memory accesses, MemPerf utilizes a top-down approach to identify all possible reasons for slow memory accesses introduced by the allocator, mainly due to cache and TLB misses, and further proposes a unified method to identify them correctly and efficiently. Based on our extensive evaluation, MemPerf reports 98% medium and large allocator-reduced slowdowns (larger than 5%) correctly without reporting any false positives. MemPerf also pinpoints multiple known and unknown design issues in widely-used allocators.more » « less
-
The proliferation of fast, dense, byte-addressable nonvolatile memory suggests the possibility of keeping data in pointer-rich “in-memory” format across program runs and even crashes. For full generality, such data requires dynamic memory allocation. Toward this end, we introduce _recoverability_, a correctness criterion for persistent allocators, together with a nonblocking allocator, _Ralloc_, that satisfies this criterion. Ralloc is based on _LRMalloc_, with three key innovations. First, we persist just enough information during normal operation to permit reconstruction of the heap after a full-system crash. Our reconstruction mechanism performs garbage collection (GC) to identify and remedy any failure-induced memory leaks. Second, in support of GC, we introduce the notion of _filter functions_, which identify the locations of pointers within persistent blocks. Third, to allow persistent regions to be mapped at an arbitrary address, we employ the position-independent pointer representation of Chen et al., both in data and in allocator metadata. Experiments show that Ralloc provides scalable performance competitive to that of both _Makalu_, the state-of-the-art lock-based persistent allocator, and the best transient allocators (e.g.,_JEMalloc_).more » « less
-
The proliferation of fast, dense, byte-addressable nonvolatile memory suggests that data might be kept in pointer-rich “in-memory” format across program runs and even process and system crashes. For full generality, such data requires dynamic memory allocation, and while the allocator could in principle be “rolled into” each data structure, it is desirableto make it a separate abstraction. Toward this end, we introduce _recoverability_, a correctness criterion for persistent allocators, together with a nonblocking allocator, _Ralloc,_ that satisfies this criterion. Ralloc is based on the _LRMalloc_ of Leite and Rocha, with four key innovations: First, we persist just enough information during normal operation to permit a garbage collection (GC) pass to correctly reconstruct the heap in the wake of a full-system crash. Second, we introduce the notion of _filter functions_, which identify the locations of pointers within persistent blocks to mitigate the limitations of conservative GC. Third, we reorganize the layout of the heap to facilitate the incremental allocation of physical space. Fourth, we employ position-independent (offset-based) pointers to allow persistent regions to be mapped at an arbitrary address. Experiments show Ralloc to be performance-competitive with both _Makalu_, the state-of-the-art lock-based persistent allocator, and such transient allocators as LRMalloc and JEMalloc. In particular, reliance on GC and offline metadata reconstruction allows Ralloc to pay almost nothing for persistence during normal operation.more » « less
-
Memory allocation is increasingly important to parallel performance, yet it is challenging because a program has data of many sizes, and the demand differs from thread to thread. Modern allocators use highly tuned heuristics but do not provide uniformly good performance when the level of concurrency increases from a few threads to hundreds of threads. This paper presents a new timescale theory to model the memory demand in real time. Using the new theory, an allocator can ad- just its synchronization frequency using a single parameter called allocations per fetch (apf ). The paper presents the timescale the- ory, the design and implementation of APF tuning in an existing allocator, and evaluation of the effect on program speed and mem- ory efficiency. APF tuning improves the throughput of MongoDB by 55%, reduces the tail latency of a Web server by over 60%, and increases the speed of a selection of synthetic benchmarks by up to 24× while using the same amount of memory.more » « less