HAL: Computer System for Scalable Deep Learning

Kindratenko, Volodymyr; Mu, Dawei; Zhan, Yan; Maloney, John; Hashemi, Sayed Hadi; Rabe, Benjamin; Xu, Ke; Campbell, Roy; Peng, Jian; Gropp, William

doi:10.1145/3311790.3396649

Citation Details

HAL: Computer System for Scalable Deep Learning

We describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results. more »

Award ID(s):: 1725729

PAR ID:: 10190049

Author(s) / Creator(s):: Kindratenko, Volodymyr; Mu, Dawei; Zhan, Yan; Maloney, John; Hashemi, Sayed Hadi; Rabe, Benjamin; Xu, Ke; Campbell, Roy; Peng, Jian; Gropp, William

Date Published:: 2020-01-01

Journal Name:: PEARC '20: Practice and Experience in Advanced Research Computing

Page Range / eLocation ID:: 41 to 48

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3311790.3396649

More Like this