skip to main content


Title: CheckFreq: Frequent, Fine-Grained DNN Checkpointing
Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task. During training, the model performs computation at the GPU to learn weights, repeatedly, over several epochs. The learned weights reside in GPU memory, and are occasionally checkpointed (written to persistent storage) for fault-tolerance. Traditionally, model parameters are checkpointed at epoch boundaries; for modern deep networks, an epoch runs for several hours. An interruption to the training job due to preemption, node failure, or process failure, therefore results in the loss of several hours worth of GPU work on recovery. We present CheckFreq, an automatic, fine-grained checkpointing framework that (1) algorithmically determines the checkpointing frequency at the granularity of iterations using systematic online profiling, (2) dynamically tunes checkpointing frequency at runtime to bound the checkpointing overhead using adaptive rate tuning, (3) maintains the training data invariant of using each item in the dataset exactly once per epoch by checkpointing data loader state using a light-weight resumable iterator, and (4) carefully pipelines checkpointing with computation to reduce the checkpoint cost by introducing two-phase checkpointing. Our experiments on a variety of models, storage backends, and GPU generations show that CheckFreq can reduce the recovery time from hours to seconds while bounding the runtime overhead within 3.5%.  more » « less
Award ID(s):
1751277
NSF-PAR ID:
10286595
Author(s) / Creator(s):
; ;
Editor(s):
Aguilera, Marcos; Yadgar, Gala
Date Published:
Journal Name:
Proceedings of the USENIX Conference on File and Storage Technologies (FAST 2021)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. For general DAGs this problem is hopelessly intractable. In fact, given a solution, computing its expected makespan is still a difficult problem. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs (M-SPGS). It turns out that many real-world workflow applications are naturally structured as M-SPGS. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide which tasks in these sub-gaphs should be checkpointed. Furthermore, it is possible to efficiently compute the expected makespan for the solution produced by this algorithm, using a first-order approximation of task weights and existing evaluation algorithms for 2-state probabilistic DAGs. We assess the performance of our algorithm for production workflow configurations, comparing it to (i) an approach in which all application data is checkpointed, which corresponds to the standard way in which most production workflows are executed today; and (ii) an approach in which no application data is checkpointed. Our results demonstrate that our algorithm strikes a good compromise between these two approaches, leading to lower checkpointing overhead than the former and to better resilience to failure than the latter. 
    more » « less
  2. null (Ed.)
    Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU, for CPU-GPU synchronization on inference task completions, etc. These result in overheads. We present a DNN inference framework with a set of software primitives that reduce the overhead for DNN inference, increase GPU utilization and improve performance, with lower latency and higher throughput. Our first primitive uses the GPU DMA effectively, reducing the CPU cycles spent to transfer the data to the GPU. A second primitive uses asynchronous ‘events’ for faster task completion notification. GPU runtimes typically preclude fine-grained user control on GPU resources, causing long GPU downtimes when adjusting resources. Our third primitive supports overlapping of model-loading and execution, thus allowing GPU resource re-allocation with very little GPU idle time. Our other primitives increase inference throughput by improving scheduling and processing more requests. Overall, our primitives decrease inference latency by more than 35% and increase DNN throughput by 2-3×. 
    more » « less
  3. Unified Virtual Memory (UVM) was recently introduced with CUDA version 8 and the Pascal GPU. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs have started taking advantage of UVM for the same reasons of superior programmability that UNIX applications long ago switched to assuming the presence of virtual memory. Therefore, checkpointing of UVM has become increasing important, especially as NVIDIA CUDA continues to gain wider popularity: 87 of the top 500 supercomputers in the latest listings use NVIDIA GPUs, with a current trend of ten additional NVIDIA-based supercomputers each year. A new scalable checkpointing mechanism, CRUM (Checkpoint-Restart for Unified Memory), is demonstrated for hybrid CUDA/MPI computations across multiple computer nodes. The support for UVM is particularly attractive for programs requiring more memory than resides on the GPU, since the alternative to UVM is for the application to directly copy memory between device and host. Furthermore, CRUM supports a fast, forked checkpointing, which mostly overlaps the CUDA computation with storage of the checkpoint image in stable storage. The runtime overhead of using CRUM is 6% on average, and the time for forked checkpointing is seen to be a factor of up to 40 times less than traditional, synchronous checkpointing. 
    more » « less
  4. Optimizing performance for Distributed Deep Neural Network (DDNN) training has recently become increasingly compelling, as the DNN model gets complex and the training dataset grows large. While existing works on communication scheduling mostly focus on overlapping the computation and communication to improve DDNN training performance, the GPU and network resources are still under-utilized in DDNN training clusters. To tackle this issue, in this paper, we design and implement a predictable communication scheduling strategy named Prophet to schedule the gradient transfer in an adequate order, with the aim of maximizing the GPU and network resource utilization. Leveraging our observed stepwise pattern of gradient transfer start time, Prophet first uses the monitored network bandwidth and the profiled time interval among gradients to predict the appropriate number of gradients that can be grouped into blocks. Then, these gradient blocks can be transferred one by one to guarantee high utilization of GPU and network resources while ensuring the priority of gradient transfer (i.e., low-priority gradients cannot preempt high-priority gradients in the network transfer). Prophet can make the forward propagation start as early as possible so as to greedily reduce the waiting (idle) time of GPU resources during the DDNN training process. Prototype experiments with representative DNN models trained on Amazon EC2 demonstrate that Prophet can improve the DDNN training performance by up to 40% compared with the state-of-theart priority-based communication scheduling strategies, yet with negligible runtime performance overhead. 
    more » « less
  5. Model compression is an important technique to facilitate efficient embedded and hardware implementations of deep neural networks (DNNs), a number of prior works are dedicated to model compression techniques. The target is to simultaneously reduce the model storage size and accelerate the computation, with minor effect on accuracy. Two important categories of DNN model compression techniques are weight pruning and weight quantization. The former leverages the redundancy in the number of weights, whereas the latter leverages the redundancy in bit representation of weights. These two sources of redundancy can be combined, thereby leading to a higher degree of DNN model compression. However, a systematic framework of joint weight pruning and quantization of DNNs is lacking, thereby limiting the available model compression ratio. Moreover, the computation reduction, energy efficiency improvement, and hardware performance overhead need to be accounted besides simply model size reduction, and the hardware performance overhead resulted from weight pruning method needs to be taken into consideration. To address these limitations, we present ADMM-NN, the first algorithm-hardware co-optimization framework of DNNs using Alternating Direction Method of Multipliers (ADMM), a powerful technique to solve non-convex optimization problems with possibly combinatorial constraints. The first part of ADMM-NN is a systematic, joint framework of DNN weight pruning and quantization using ADMM. It can be understood as a smart regularization technique with regularization target dynamically updated in each ADMM iteration, thereby resulting in higher performance in model compression than the state-of-the-art. The second part is hardware-aware DNN optimizations to facilitate hardware-level implementations. We perform ADMM-based weight pruning and quantization considering (i) the computation reduction and energy efficiency improvement, and (ii) the hardware performance overhead due to irregular sparsity. The first requirement prioritizes the convolutional layer compression over fully-connected layers, while the latter requires a concept of the break-even pruning ratio, defined as the minimum pruning ratio of a specific layer that results in no hardware performance degradation. Without accuracy loss, ADMM-NN achieves 85× and 24× pruning on LeNet-5 and AlexNet models, respectively, --- significantly higher than the state-of-the-art. The improvements become more significant when focusing on computation reduction. Combining weight pruning and quantization, we achieve 1,910× and 231× reductions in overall model size on these two benchmarks, when focusing on data storage. Highly promising results are also observed on other representative DNNs such as VGGNet and ResNet-50. We release codes and models at https://github.com/yeshaokai/admm-nn. 
    more » « less