Delay sensitivity-driven congestion mitigation for HPC systems

Patke, Archit; Jha, Saurabh; Qiu, Haoran; Brandt, Jim; Gentile, Ann; Greenseid, Joe; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.

doi:10.1145/3447818.3460362

Citation Details

Delay sensitivity-driven congestion mitigation for HPC systems

Modern high-performance computing (HPC) systems concurrently execute multiple distributed applications that contend for the high-speed network leading to congestion. Consequently, application runtime variability and suboptimal system utilization are observed in production systems. To address these problems, we propose Netscope, a congestion mitigation framework based on a novel delay sensitivity metric. Delay sensitivity of an application is used to quantify the impact of congestion on its runtime. Netscope uses delay sensitivity estimates to drive a congestion mitigation mechanism to selectively throttle applications that are less susceptible to congestion. We evaluate Netscope on two Cray Aries systems, including a production supercomputer, on common scientific applications. Our evaluation shows that Netscope has a low training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7 and 0.9. Moreover, Netscope reduces application tail runtime increase by up to 16.3x while improving the median system utility by 12%. more »

Award ID(s):: 2029049

PAR ID:: 10292980

Author(s) / Creator(s):: Patke, Archit; Jha, Saurabh; Qiu, Haoran; Brandt, Jim; Gentile, Ann; Greenseid, Joe; Kalbarczyk, Zbigniew; Iyer, Ravishankar K.

Date Published:: 2021-06-03

Journal Name:: Proceedings of The 35th ACM International Conference on Supercomputing (ICS ‘21)

Page Range / eLocation ID:: 342 to 353

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3447818.3460362

More Like this