NSF PAR Search | NSF Public Access Repository

Efficient User-Level Storage Disaggregation for Deep Learning

https://doi.org/10.1109/CLUSTER.2019.8891023

Zhu, Yue; Yu, Weikuan; Jiao, Bing; Mohror, Kathryn; Moody, Adam; Chowdhury, Fahim (September 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER))

On large-scale high performance computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a specialized Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization.

Full Text Available

Search for: All records