MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

Farrell, Steven; Emani, Murali; Balma, Jacob; Drescher, Lukas; Drozd, Aleksandr; Fink, Andreas; Fox, Geoffrey; Kanter, David; Kurth, Thorsten; Mattson, Peter; Mu, Dawei; Ruhela, Amit; Sato, Kento; Shirahata, Koichi; Tabaru, Tsuguchika; Tsaris, Aristeidis; Balewski, Jan; Cumming, Ben; Danjo, Takumi; Domke, Jens; Fukai, Takaaki; Fukumoto, Naoto; Fukushi, Tatsuya; Gerofi, Balazs; Honda, Takumi; Imamura, Toshiyuki; Kasagi, Akihiko; Kawakami, Kentaro; Kudo, Shuhei; Kuroda, Akiyoshi; Martinasso, Maxime; Matsuoka, Satoshi; Mendonca, Henrique; Minami, Kazuki; Ram, Prabhat; Sawada, Takashi; Shankar, Mallikarjun; John, Tom St.; Tabuchi, Akihiro; Vishwanath, Venkatram; Wahib, Mohamed; Yamazaki, Masafumi; Yin, Junqi

doi:10.1109/MLHPC54614.2021.00009

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf ™ is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications, driven by the MLCommons ™ Association. We present the results from the first submission round including a diverse set of some of the world’s largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization and communication scheduling enabling overall >10× (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system’s memory hierarchy and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch-sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O and network behaviour to parameterize extended roofline performance models in future rounds.

More Like this