%AKindratenko, Volodymyr%AMu, Dawei%AZhan, Yan%AMaloney, John%AHashemi, Sayed%ARabe, Benjamin%AXu, Ke%ACampbell, Roy%APeng, Jian%AGropp, William%D2020%I
%K
%MOSTI ID: 10190049
%PMedium: X
%THAL: Computer System for Scalable Deep Learning
%XWe describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results.
Country unknown/Code not availablehttps://doi.org/10.1145/3311790.3396649OSTI-MSA