skip to main content


Title: Optimizing for KNL Usage Modes When Data Doesn’t Fit in MCDRAM
Technologies such as Multi-Channel DRAM (MCDRAM) or High Bandwidth Memory (HBM) provide significantly more bandwidth than conventional memory. This trend has raised questions about how applications should manage data transfers between levels.This paper focuses on evaluating different usage modes of the MCDRAM in Intel Knights Landing (KNL) manycore processors. We evaluate these usage modes with a sorting kernel and a sortingbased streaming benchmark. We develop a performance model for the benchmark and use experimental evidence to demonstrate the correctness of the model. The model projects near-optimal numbers of copy threads for memory bandwidth bound computations. We demonstrate on KNL up to a 1.9X speedup for sort when the problem does not fit in MCDRAM over an OpenMP GNU sort that does not use MCDRAM.  more » « less
Award ID(s):
1642280
NSF-PAR ID:
10064736
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
International Conference on Parallel Processing
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Many high-performance systems now include different types of memory devices within the same compute platform to meet strict performance and cost constraints. Such heterogeneous memory systems often include an upper-level tier with better performance, but limited capacity, and lower-level tiers with higher capacity, but less bandwidth and longer latencies for reads and writes. To utilize the different memory layers efficiently, current systems rely on hardware-directed, memory -side caching or they provide facilities in the operating system (OS) that allow applications to make their own data-tier assignments. Since these data management options each come with their own set of trade-offs, many systems also include mixed data management configurations that allow applications to employ hardware- and software-directed management simultaneously, but for different portions of their address space. Despite the opportunity to address limitations of stand-alone data management options, such mixed management modes are under-utilized in practice, and have not been evaluated in prior studies of complex memory hardware. In this work, we develop custom program profiling, configurations, and policies to study the potential of mixed data management modes to outperform hardware- or software-based management schemes alone. Our experiments, conducted on an Intel ® Knights Landing platform with high-bandwidth memory, demonstrate that the mixed data management mode achieves the same or better performance than the best stand-alone option for five memory intensive benchmark applications (run separately and in isolation), resulting in an average speedup compared to the best stand-alone policy of over 10 %, on average. 
    more » « less
  2. null (Ed.)
    SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P 􀀀! 1 communication overhead is by far the dominant factor despite improved computational performance. 
    more » « less
  3. SpMV, the product of a sparse matrix and a dense vector, is emblematic of a new class of applications that are memory bandwidth and communication, not flop, driven. Sparsity and randomness in such computations play havoc with performance, especially when strong, instead of weak, scaling is attempted. In this study we develop and evaluate a hybrid implementation for strong scaling of the Compressed Vectorization-oriented sparse Row (CVR) approach to SpMV on a cluster of Intel Xeon Phi Knights Landing (KNL) processors. We show how our hybrid SpMV implementation achieves increased computational performance, yet does not address the dominant communication overhead factor at extreme scale. Issues with workload distribution, data placement, and remote reductions are assessed over a range of matrix characteristics. Our results indicate that as P 􀀀! 1 communication overhead is by far the dominant factor despite improved computational performance. 
    more » « less
  4. The Intel Knight Landing (KNL) manycore chip includes 3D-stacked memory named MCDRAM, also known as High Bandwidth Memory (HBM) for parallel applications that needs to scale to high thread count. In this paper, we provide a quantitative study of the KNL for HPC proxy applications including Lulesh, HPCG, AMG, and Hotspot when using DDR4 and MCDRAM. The results indicate that HBM significantly improves the performance of memory intensive applications for as many as three times better than DDR4 in HPCG, and Lulesh and HPCG for as many as 40% and 200%. For the selected compute intensive applications, the performance advantage of MCDRAM over DDR4 varies from 2% to 28%. We also observed that the cross-points, where MCDRAM starts outperforming DDR4, are around 8 to 16 threads. 
    more » « less
  5. The process of code adaptation to take advantage of the latest innovations in a supercomputing platform begins with learning about the details of the platform's underlying hardware. It can be challenging for many users to spend time and effort in developing an understanding of the innovative features in a supercomputing platform - such as deep memory hierarchies - and to harness their maximum possible performance by manually modernizing their applications. To mitigate the aforementioned challenge, we are developing an Interactive Code Adaptation Tool (ICAT). In its current form, ICAT can assist the users in modifying, compiling, and optimally running their applications on the latest HPC platforms that are equipped with the Intel Knights Landing (KNL) processors. ICAT detects a given application's characteristics such as memory usage pattern, type of memory allocation, and execution time. Depending upon the application's characteristics, it advises the user on optimal ways to take advantage of the KNL processor and its memory-hierarchy. 
    more » « less