skip to main content


Title: Multi-source Log Clustering in Distributed Systems
Distributed systems are seeing wider use as software becomes more complex and cloud systems increase in popularity. Preforming anomaly detection and other log analysis procedures on distributed systems have not seen much research. To this end, we propose a simple and generic method of clustering log statements from separate log files to perform future log analysis. We identify variable components of log statements and find matches of these variables between the sources. After scoring the variables, we select the one with the highest score to be the clustering basis. We performed a case study of our method on the two open-source projects, to which we found success in the results of our method and created an open-source project log-matcher.  more » « less
Award ID(s):
1854049
NSF-PAR ID:
10310334
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Information Science and Applications. Lecture Notes in Electrical Engineering
Volume:
739
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In modern Machine Learning, model training is an iterative, experimental process that can consume enormous computation resources and developer time. To aid in that process, experienced model developers log and visualize program variables during training runs. Exhaustive logging of all variables is infeasible, so developers are left to choose between slowing down training via extensive conservative logging, or letting training run fast via minimalist optimistic logging that may omit key information. As a compromise, optimistic logging can be accompanied by program checkpoints; this allows developers to add log statements post-hoc, and "replay" desired log statements from checkpoint---a process we refer to as hindsight logging. Unfortunately, hindsight logging raises tricky problems in data management and software engineering. Done poorly, hindsight logging can waste resources and generate technical debt embodied in multiple variants of training code. In this paper, we present methodologies for efficient and effective logging practices for model training, with a focus on techniques for hindsight logging. Our goal is for experienced model developers to learn and adopt these practices. To make this easier, we provide an open-source suite of tools for Fast Low-Overhead Recovery (flor) that embodies our design across three tasks: (i) efficient background logging in Python, (ii) adaptive periodic checkpointing, and (iii) an instrumentation library that codifies hindsight logging for efficient and automatic record-replay of model-training. Model developers can use each flor tool separately as they see fit, or they can use flor in hands-free mode, entrusting it to instrument their code end-to-end for efficient record-replay. Our solutions leverage techniques from physiological transaction logs and recovery in database systems. Evaluations on modern ML benchmarks demonstrate that flor can produce fast checkpointing with small user-specifiable overheads (e.g. 7%), and still provide hindsight log replay times orders of magnitude faster than restarting training from scratch. 
    more » « less
  2. Logging is a significant programming practice. Due to the highly transactional nature of modern software applications, massive amount of logs are generated every day, which may overwhelm developers. Logging information overload can be dangerous to software applications. Using log levels, developers can print the useful information while hiding the verbose logs during software runtime. As software evolves, the log levels of logging statements associated with the surrounding software feature implementation may also need to be altered. Maintaining log levels necessitates a significant amount of manual effort. In this paper, we demonstrate an automated approach that can rejuvenate feature log levels by matching the interest level of developers in the surrounding features. The approach is implemented as an open-source Eclipse plugin, using two external plug-ins (JGit and Mylyn). It was tested on 18 open-source Java projects consisting of ~3 million lines of code and ~4K log statements. Our tool successfully analyzes 99.22\% of logging statements, increases log level distributions by ~20\%, and increases the focus of logs in bug fix contexts ~83\% of the time. For further details, interested readers can watch our demonstration video (https://www.youtube.com/watch?v=qIULoAXoDv4). 
    more » « less
  3. SUMMARY

    We introduce a new finite-element (FE) based computational framework to solve forward and inverse elastic deformation problems for earthquake faulting via the adjoint method. Based on two advanced computational libraries, FEniCS and hIPPYlib for the forward and inverse problems, respectively, this framework is flexible, transparent and easily extensible. We represent a fault discontinuity through a mixed FE elasticity formulation, which approximates the stress with higher order accuracy and exposes the prescribed slip explicitly in the variational form without using conventional split node and decomposition discrete approaches. This also allows the first order optimality condition, that is the vanishing of the gradient, to be expressed in continuous form, which leads to consistent discretizations of all field variables, including the slip. We show comparisons with the standard, pure displacement formulation and a model containing an in-plane mode II crack, whose slip is prescribed via the split node technique. We demonstrate the potential of this new computational framework by performing a linear coseismic slip inversion through adjoint-based optimization methods, without requiring computation of elastic Green’s functions. Specifically, we consider a penalized least squares formulation, which in a Bayesian setting—under the assumption of Gaussian noise and prior—reflects the negative log of the posterior distribution. The comparison of the inversion results with a standard, linear inverse theory approach based on Okada’s solutions shows analogous results. Preliminary uncertainties are estimated via eigenvalue analysis of the Hessian of the penalized least squares objective function. Our implementation is fully open-source and Jupyter notebooks to reproduce our results are provided. The extension to a fully Bayesian framework for detailed uncertainty quantification and non-linear inversions, including for heterogeneous media earthquake problems, will be analysed in a forthcoming paper.

     
    more » « less
  4. Workflow reconstruction through logs is crucial for troubleshooting targeted distributed systems. It is also challenging to extract enough information from logs and keep a concise view, which makes manual log analysis hard to practice. However, currently popular tools rely on identifier-based log parsing, leaving a large amount of workflow information unexploited. In this paper, we propose a log extraction approach NLog, which utilizes a natural language processing based approach to obtain the key information from log messages and identify the same object in logs generated by different statements without any domain knowledge. We propose to use keyed message, a new log storage structure to store the parsed logs. We implement NLog and apply it to distributed data analytics frameworks Spark and MapReduce. Evaluation results show that NLog can accurately identify the objects in log messages even without explicit identifiers. By using keyed messages, users can have a concise as well as flexible view of the workflows. 
    more » « less
  5. Abstract

    The coronavirus pandemic has already caused plenty of severe problems for humanity and the economy. The exact impact of the COVID-19 pandemic is still unknown, and economists and financial advisers are exploring all possible scenarios to mitigate the risks arising from the pandemic. An intriguing question is whether this pandemic and its impacts are similar, and to what extent, to any other catastrophic events that occurred in the past, such as the 2009 Great Recession. This paper intends to address this problem by analyzing official public announcements and statements issued by federal authorities such as the Federal Reserve. More specifically, we measure similarities of consecutive statements issued by the Federal Reserve during the 2009 Great Recession and the COVID-19 pandemic using natural language processing techniques. Furthermore, we explore the usage of document embedding representations of the statements in a more complex task: clustering. Our analysis shows that, using an advanced NLP technique in document embedding such as Doc2Vec, we can detect a difference of 10.8% in similarities of Federal Open Market Committee (FOMC) statements issued during the Great Recession (2007–2009) and the COVID-19 pandemic. Finally, the results of our clustering exercise show that the document embeddings representations of the statements are suitable for more complex tasks, which provides a basis for future applications of state-of-the-art natural language processing techniques using the FOMC post-meeting statements as the dataset.

     
    more » « less