skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Hindsight logging for model training
In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch.  more » « less
Award ID(s):
1730628
PAR ID:
10219496
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
4
ISSN:
2150-8097
Page Range / eLocation ID:
682 to 693
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Logging is a significant programming practice. Due to the highly transactional nature of modern software applications, massive amount of logs are generated every day, which may overwhelm developers. Logging information overload can be dangerous to software applications. Using log levels, developers can print the useful information while hiding the verbose logs during software runtime. As software evolves, the log levels of logging statements associated with the surrounding software feature implementation may also need to be altered. Maintaining log levels necessitates a significant amount of manual effort. In this paper, we demonstrate an automated approach that can rejuvenate feature log levels by matching the interest level of developers in the surrounding features. The approach is implemented as an open-source Eclipse plugin, using two external plug-ins (JGit and Mylyn). It was tested on 18 open-source Java projects consisting of ~3 million lines of code and ~4K log statements. Our tool successfully analyzes 99.22\% of logging statements, increases log level distributions by ~20\%, and increases the focus of logs in bug fix contexts ~83\% of the time. For further details, interested readers can watch our demonstration video (https://www.youtube.com/watch?v=qIULoAXoDv4). 
    more » « less
  2. Aguilera, Marcos; Yadgar, Gala (Ed.)
    Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task. During training, the model performs computation at the GPU to learn weights, repeatedly, over several epochs. The learned weights reside in GPU memory, and are occasionally checkpointed (written to persistent storage) for fault-tolerance. Traditionally, model parameters are checkpointed at epoch boundaries; for modern deep networks, an epoch runs for several hours. An interruption to the training job due to preemption, node failure, or process failure, therefore results in the loss of several hours worth of GPU work on recovery. We present CheckFreq, an automatic, fine-grained checkpointing framework that (1) algorithmically determines the checkpointing frequency at the granularity of iterations using systematic online profiling, (2) dynamically tunes checkpointing frequency at runtime to bound the checkpointing overhead using adaptive rate tuning, (3) maintains the training data invariant of using each item in the dataset exactly once per epoch by checkpointing data loader state using a light-weight resumable iterator, and (4) carefully pipelines checkpointing with computation to reduce the checkpoint cost by introducing two-phase checkpointing. Our experiments on a variety of models, storage backends, and GPU generations show that CheckFreq can reduce the recovery time from hours to seconds while bounding the runtime overhead within 3.5%. 
    more » « less
  3. When a security vulnerability or other critical bug is not detected by the developers' test suite, and is discovered post-deployment, developers must quickly devise a new test that reproduces the buggy behavior. Then the developers need to test whether their candidate patch indeed fixes the bug, without breaking other functionality, while racing to deploy before attackers pounce on exposed user installations. This can be challenging when factors in a specific user environment triggered the bug. If enabled, however, record-replay technology faithfully replays the execution in the developer environment as if the program were executing in that user environment under the same conditions as the bug manifested. This includes intermediate program states dependent on system calls, memory layout, etc. as well as any externally-visible behavior. Many modern record-replay tools integrate interactive debuggers, to help locate the root cause, but don't help the developers test whether their patch indeed eliminates the bug under those same conditions. In particular, modern record-replay tools that reproduce intermediate program state cannot replay recordings made with one version of a program using a different version of the program where the differences affect program state. This work builds on record-replay and binary rewriting to automatically generate and run targeted tests for candidate patches significantly faster and more efficiently than traditional test suite generation techniques like symbolic execution. These tests reflect the arbitrary (ad hoc) user and system circumstances that uncovered the bug, enabling developers to check whether a patch indeed fixes that bug. The tests essentially replay recordings made with one version of a program using a different version of the program, even when the the differences impact program state, by manipulating both the binary executable and the recorded log to result in an execution consistent with what would have happened had the the patched version executed in the user environment under the same conditions where the bug manifested with the original version. Our approach also enables users to make new recordings of their own workloads with the original version of the program, and automatically generate and run the corresponding ad hoc tests on the patched version, to validate that the patch does not break functionality they rely on. 
    more » « less
  4. Modern kernels are large, complex, and plagued with bugs. Unfortunately, their large size and complexity make kernel failures very challenging for developers to diagnose since failures encountered in deployment are often notoriously difficult to reproduce. Although record-replay techniques provide the powerful ability to accurately record a failed execution and deterministically replay it, enabling advanced manual and automated analysis techniques, they are inefficient and do not scale with modern I/O-intensive, concurrent workloads. This paper introduces KRR, a kernel record-replay framework that provides a highly efficient execution recording mechanism by narrowing the scope of the record and replay boundary to the kernel. Unlike previous record-replay wholestack approaches, KRR adopts a split-recorder design that employs the guest and the host to jointly record the kernel execution. Our evaluation demonstrates that KRR scales efficiently up to 8 cores, across a range of different workloads, including kernel compilation, RocksDB, and Nginx. When recording 8-core VMs that run RocksDB and kernel compilation, KRR incurs only a 1.52× ∼ 2.79× slowdown compared to native execution, while traditional whole-VM RR suffers from 8.97× ∼ 29.94× slowdown. We validate that KRR is practical and has a broad recording scope by reproducing 17 bugs across different Linux versions, including 6 non-deterministic bugs and 5 high-risk CVEs; KRR was able to record and reproduce all but one non-deterministic bug. 
    more » « less
  5. null (Ed.)
    Logging is a vital part of the software development process. Developers use program logging to monitor program execution and identify errors and anomalies. These errors may also cause uncaught exceptions and generate stack traces that help identify the point of error. Both of these sources contain information that can be matched to points in the source code, but manual log analysis is challenging for large systems that create large volumes of logs and have large codebases. In this paper, we contribute a systematic mapping study to determine the state-of-the-art tools and methods used to perform automatic log analysis and stack trace analysis and match the extracted information back to the program's source code. We analyzed 16 publications that address this issue, summarizing their strategies and goals, and we identified open research directions from this body of work. 
    more » « less