skip to main content


Title: On Efficient Constructions of Checkpoints
Efficient construction of checkpoints/snapshots is a critical tool for training and diagnosing deep learning models. In this paper, we propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint). LC-Checkpoint simultaneously maximizes the compression rate and optimizes the recovery speed, under the assumption that SGD is used to train the model. LC-Checkpointuses quantization and priority promotion to store the most crucial information for SGD to recover, and then uses a Huffman coding to leverage the non-uniform distribution of the gradient scales. Our extensive experiments show that LC-Checkpoint achieves a compression rate up to 28× and recovery speedup up to 5.77× over a state-of-the-art algorithm (SCAR).  more » « less
Award ID(s):
1835821
NSF-PAR ID:
10212765
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. N/A (Ed.)

    Optimal function in the brain, especially in hippocampus—an area involved in learning and memory—requires tight regulation of intracellular pH (pHi) within neurons and neuroglial. The Na‐H exchangers (NHEs) are the major family of acid/base proteins involved in regulating pHi in the absence of CO2/HCO3. In the present study, we used the pH‐sensitive dye BCECF to examine the regulation of steady‐state pHi and the recovery of pHi from NH4+ ‐induced intracellular acid loads in HC neurons and astrocytes, co‐cultured from embryonic (E18‐20) Sprague Dawley rats, and studied in CO2/HCO3 −‐free HEPES buffered (“HEPES”) solutions. After at least 14‐days in a CO2/HCO3 – incubator, cells were removed, loaded with BCECF, and placed in a recording chamber with flowing HEPES. At the beginning of each experiment, we measured pHi (checkpoint A) after allowing pHi to stabilize for 5 minutes (checkpoint C), and reported mean “initial pHi”/SEM for neurons as 7.351/0.0597; N=37 (astrocytes: 7.189/0.0118, N=25) the value at checkpoint C = (pHi)C. After using the twin paired NH4+ ‐pulse protocol to acid load cells, we find that—after the pHi recovery from the first acid load—the average neuronal steady‐state pHi (now at checkpoint E; (pHi)E) is 6.953/0.0601(astrocytes: 7.037/0.0081). After the second NH4+ pulse the neuronal steady‐state pHi (now at checkpoint F; (pHi)F) in neurons is 6.937/0.010 (astrocytes: 7.020/0.0062). The recovery from acidosis is fit with a double exponential (DExp) which we replot as dpHi/dt vs pHi. With this traditional approach, dpHi/dt, the fit as it approaches the asymptotic pHi, becomes slightly non‐linear. To exploit the mainly linearity portion of the dpHi/dt vs. pHi plot (from the DExp fit) of the double exponential, we fit these dpHi/dt vs. pHi points with a DExp with a quasi‐ single exponential (SExp) to produce a quasi–single‐exponential rate constant (kqSExp) measured as dpH/dt. This analysis—when transformed to the pHi vs. time domain—generally produces a very good fit to the original pHi vs. time data. The mean kqSExp1 in neurons is 0.0054/ 0.0008 (astrocytes: 0.0107/0.0002) whereas the mean kqSExp2 in neurons is 0.0055/0.0008 (astrocytes: 0.0010/0.0003). We summarize the twin pHi recoveries from individual experiments in which we display as thumbnails the quasi–single‐exponential dpHi/dt line segments that represent the pHi recoveries from the first and second NH3/NH4+ pulses. These new analytical approaches may ultimately provide mechanistic insight into cell‐to‐cell heterogeneity of pHi regulation in the nervous system.

     
    more » « less
  2. In this paper, we propose and analyze SPARQSGD, an event-triggered and compressed algorithm for decentralized training of large-scale machine learning models over a graph. Each node can locally compute a condition (event) which triggers a communication where quantized and sparsified local model parameters are sent. In SPARQ-SGD, each node first takes a fixed number of local gradient steps and then checks if the model parameters have significantly changed compared to its last update; it communicates further compressed model parameters only when there is a significant change, as specified by a (design) criterion. We prove that SPARQ-SGD converges as O(1/nT ) and O(1/√nT ) in the strongly-convex and non-convex settings, respectively, demonstrating that aggressive compression, including event-triggered communication, model sparsification and quantization does not affect the overall convergence rate compared to uncompressed decentralized training; thereby theoretically yielding communication efficiency for `free'. We evaluate SPARQ-SGD over real datasets to demonstrate significant savings in communication bits over the state-of-the-art. 
    more » « less
  3. This paper addresses the real-time encoding-decoding problem for high-frame-rate video compressive sensing (CS). Unlike prior works that perform reconstruction using iterative optimization-based approaches, we propose a noniterative model, named "CSVideoNet", which directly learns the inverse mapping of CS and reconstructs the original input in a single forward propagation. To overcome the limitations of existing CS cameras, we propose a multi-rate CNN and a synthesizing RNN to improve the trade-o. between compression ratio (CR) and spatial-temporal resolution of the reconstructed videos. the experiment results demonstrate that CSVideoNet significantly outperforms state-of-the-art approaches. Without any pre/post-processing, we achieve a 25dB Peak signal-to-noise ratio (PSNR) recovery quality at 100x CR, with a frame rate of 125 fps on a Titan X GPU. Due to the feedforward and high-data-concurrency natures of CSVideoNet, it can take advantage of GPU acceleration to achieve three orders of magnitude speed-up over conventional iterative-based approaches. We share the source code at https://github.com/PSCLab-ASU/CSVideoNet. 
    more » « less
  4. Pregel-like systems are popular for iterative graph processing thanks to their user-friendly vertex-centric programming model. However, existing Pregel-like systems only adopt a naïve checkpointing approach for fault tolerance, which saves a large amount of data about the state of computation and signi!cantly degrades the failure-free execution performance. Advanced fault tolerance/recovery techniques are left unexplored in the context of Pregel-like systems. This paper proposes a non-invasive lightweight checkpointing (LWCP) scheme which minimizes the data saved to each checkpoint, and additional data required for recovery are generated online from the saved data. This improvement results in 10x speedup in checkpointing, and an integration of it with a recently proposed log-based recovery approach can further speed up recovery when failure occurs. Extensive experiments veri!ed that our proposed LWCP techniques are able to signi!cantly improve the performance of both checkpointing and recovery in a Pregel-like system. 
    more » « less
  5. Exascale computing must simultaneously address both energy efficiency and resilience as power limits impact scalability and faults are more common. Unfortunately, energy efficiency and resilience have been traditionally studied in isolation and optimizing one typically detrimentally impacts the other. To deliver the promised performance within the given power budget, exascale computing mandates a deep understanding of the interplay among energy efficiency, resilience, and scalability. In this work, we propose novel methods to analyze and optimize costs of resilience techniques including checkpoint-restart and forward recovery for large sparse linear system solvers. In particular, we present experimental and analytical methods to analyze and quantify the time and energy costs of recovery schemes on computer clusters. We further develop and prototype performance optimization and power management strategies to improve energy efficiency. Experimental results show that recovery schemes incur different time and energy overheads and optimization techniques significantly reduce such overheads. This work suggests that resilience techniques should be adaptively adjusted to a given fault rate, system size, and power budget. 
    more » « less