skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Tool for Rejuvenating Feature Logging Levels via Git Histories and Degree of Interest
Logging is a significant programming practice. Due to the highly transactional nature of modern software applications, massive amount of logs are generated every day, which may overwhelm developers. Logging information overload can be dangerous to software applications. Using log levels, developers can print the useful information while hiding the verbose logs during software runtime. As software evolves, the log levels of logging statements associated with the surrounding software feature implementation may also need to be altered. Maintaining log levels necessitates a significant amount of manual effort. In this paper, we demonstrate an automated approach that can rejuvenate feature log levels by matching the interest level of developers in the surrounding features. The approach is implemented as an open-source Eclipse plugin, using two external plug-ins (JGit and Mylyn). It was tested on 18 open-source Java projects consisting of ~3 million lines of code and ~4K log statements. Our tool successfully analyzes 99.22\% of logging statements, increases log level distributions by ~20\%, and increases the focus of logs in bug fix contexts ~83\% of the time. For further details, interested readers can watch our demonstration video (https://www.youtube.com/watch?v=qIULoAXoDv4).  more » « less
Award ID(s):
2200343
PAR ID:
10417309
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Software Engineering: Companion Proceedings
Page Range / eLocation ID:
21 to 25
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch. 
    more » « less
  2. null (Ed.)
    Logging is a vital part of the software development process. Developers use program logging to monitor program execution and identify errors and anomalies. These errors may also cause uncaught exceptions and generate stack traces that help identify the point of error. Both of these sources contain information that can be matched to points in the source code, but manual log analysis is challenging for large systems that create large volumes of logs and have large codebases. In this paper, we contribute a systematic mapping study to determine the state-of-the-art tools and methods used to perform automatic log analysis and stack trace analysis and match the extracted information back to the program's source code. We analyzed 16 publications that address this issue, summarizing their strategies and goals, and we identified open research directions from this body of work. 
    more » « less
  3. Workflow reconstruction through logs is crucial for troubleshooting targeted distributed systems. It is also challenging to extract enough information from logs and keep a concise view, which makes manual log analysis hard to practice. However, currently popular tools rely on identifier-based log parsing, leaving a large amount of workflow information unexploited. In this paper, we propose a log extraction approach NLog, which utilizes a natural language processing based approach to obtain the key information from log messages and identify the same object in logs generated by different statements without any domain knowledge. We propose to use keyed message, a new log storage structure to store the parsed logs. We implement NLog and apply it to distributed data analytics frameworks Spark and MapReduce. Evaluation results show that NLog can accurately identify the objects in log messages even without explicit identifiers. By using keyed messages, users can have a concise as well as flexible view of the workflows. 
    more » « less
  4. Information flow control is a canonical approach to access control in systems, allowing administrators to assure confidentiality and integrity through restricting the flow of data. Decentralized Information Flow Control (DIFC) harnesses application-layer semantics to allow more precise and accurate mediation of data. Unfortunately, past approaches to DIFC have depended on dedicated instrumentation efforts or developer buy-in. Thus, while DIFC has existed for decades, it has seen little-to-no adoption in commodity systems; the requirement for complete redesign or retrofitting of programs has proven too high a barrier. In this work, we make the surprising observation that developers have already unwittingly performed the instrumentation efforts required for DIFC — application event logging, a software development best practice used for telemetry and debugging, often contains the information needed to identify application-layer event processes that DIFC mediates. We present T-difc, a kernel-layer reference monitor framework that leverages the insights of application event logs to perform precise decentralized flow control. T-difc identifies and extracts these application events as they are created by monitoring application I/O to log files, then references an administrator-specified security policy to assign data labels and mediate the flow of data through the system. To our knowledge, T-difc is the first approach to DIFC that does not require developer support or custom instrumentation. In a survey of 15 popular open source applications, we demonstrate that T-difc works seamlessly on a variety of popular open source programs while imposing negligible runtime overhead on realistic policies and workloads. Thus, T-difc demonstrates a transparent and non-invasive path forward for the dissemination of decentralized information flow controls. 
    more » « less
  5. Developers use logs to diagnose performance problems in distributed applications. But, it is difficult to know a priori where logs are needed and what information in them is needed to help diagnose problems that may occur in the future. We summarize our work on the Variance-driven Automated Instrumentation Framework (VAIF), which runs alongside distributed applications. In response to newly-observed performance problems, VAIF automatically searches the space of possible instrumentation choices to enable the logs needed to help diagnose them. To work, VAIF combines distributed tracing (an enhanced form of logging) with insights about how response-time variance can be decomposed on the criticalpath portions of requests' traces. 
    more » « less