Efficient User-Level Storage Disaggregation for Deep Learning

Zhu, Yue; Yu, Weikuan; Jiao, Bing; Mohror, Kathryn; Moody, Adam; Chowdhury, Fahim

doi:10.1109/CLUSTER.2019.8891023

Citation Details

Efficient User-Level Storage Disaggregation for Deep Learning

On large-scale high performance computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a specialized Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization. more »

Award ID(s):: 1744336 1763547 1822737 1564647 1561041

PAR ID:: 10156300

Author(s) / Creator(s):: Zhu, Yue; Yu, Weikuan; Jiao, Bing; Mohror, Kathryn; Moody, Adam; Chowdhury, Fahim

Date Published:: 2019-09-01

Journal Name:: 2019 IEEE International Conference on Cluster Computing (CLUSTER)

Page Range / eLocation ID:: 1 to 12

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/CLUSTER.2019.8891023

More Like this