NSF PAR Search | NSF Public Access Repository

Timescale functions for parallel memory allocation

https://doi.org/10.1145/3315573.3329987

Li, Pengcheng; Luo, Hao; Ding, Chen (January 2019, Proceedings of the 2019 {ACM} {SIGPLAN} International Symposium on Memory Management, {ISMM} 2019)

Memory allocation is increasingly important to parallel performance, yet it is challenging because a program has data of many sizes, and the demand differs from thread to thread. Modern allocators use highly tuned heuristics but do not provide uniformly good performance when the level of concurrency increases from a few threads to hundreds of threads. This paper presents a new timescale theory to model the memory demand in real time. Using the new theory, an allocator can ad- just its synchronization frequency using a single parameter called allocations per fetch (apf ). The paper presents the timescale the- ory, the design and implementation of APF tuning in an existing allocator, and evaluation of the effect on program speed and mem- ory efficiency. APF tuning improves the throughput of MongoDB by 55%, reduces the tail latency of a Web server by over 60%, and increases the speed of a selection of synthetic benchmarks by up to 24× while using the same amount of memory.

Full Text Available

Modern software executes a large amount of code. Previous techniques of code layout optimization were developed one or two decades ago and have become inadequate to cope with the scale and complexity of new types of applications such as compilers, browsers, interpreters, language VMs and shared libraries. This paper presents Codestitcher, an inter-procedural basic block code layout optimizer which reorders basic blocks in an executable to benefit from better cache and TLB performance. Codestitcher provides a hierarchical framework which can be used to improve locality in various layers of the memory hierarchy. Our evaluation shows that Codestitcher improves the performance of the origi- nal program by 3% to 25% (on average, by 10%) on 5 widely used applications with large code sizes: MySQL, Clang, Firefox, Apache, and Python. It gives an additional improvement of 4% over LLVM’s PGO and 3% over PGO combined with the best function reordering technique.

Search for: All records