skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: BayesPerf: minimizing performance monitoring errors using Bayesian statistics
Hardware performance counters (HPCs) that measure low-level architectural and microarchitectural events provide dynamic contextual information about the state of the system. However, HPC measurements are error-prone due to non determinism (e.g., undercounting due to event multiplexing, or OS interrupt-handling behaviors). In this paper, we present BayesPerf, a system for quantifying uncertainty in HPC measurements by using a domain-driven Bayesian model that captures microarchitectural relationships between HPCs to jointly infer their values as probability distributions. We provide the design and implementation of an accelerator that allows for low-latency and low-power inference of the BayesPerf model for x86 and ppc64 CPUs. BayesPerf reduces the average error in HPC measurements from 40.1% to 7.6% when events are being multiplexed. The value of BayesPerf in real-time decision-making is illustrated with a simple example of scheduling of PCIe transfers.  more » « less
Award ID(s):
2029049 1337732 1624790
PAR ID:
10292981
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
The 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. (ASPLOS ‘21)
Page Range / eLocation ID:
832 to 844
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Controllers of security-critical cyber-physical systems, like the power grid, are a very important class of computer systems. Attacks against the control code of a power-grid system, especially zero-day attacks, can be catastrophic. Earlier detection of the anomalies can prevent further damage. However, detecting zero-day attacks is extremely challenging because they have no known code and have unknown behavior. Furthermore, if data collected from the controller is transferred to a server through networks for analysis and detection of anomalous behavior, this creates a very large attack surface and also delays detection. In order to address this problem, we propose Reconstruction Error Distribution (RED) of Hardware Performance Counters (HPCs), and a data-driven defense system based on it. Specifically, we first train a temporal deep learning model, using only normal HPC readings from legitimate processes that run daily in these power-grid systems, to model the normal behavior of the power-grid controller. Then, we run this model using real-time data from commonly available HPCs. We use the proposed RED to enhance the temporal deep learning detection of anomalous behavior, by estimating distribution deviations from the normal behavior with an effective statistical test. Experimental results on a real power-grid controller show that we can detect anomalous behavior with high accuracy (>99.9%), nearly zero false positives and short (<360ms) latency. 
    more » « less
  2. Power modeling is an essential building block for computer systems in support of energy optimization, energy profiling, and energy-aware application development. We introduce VESTA, a novel approach to modeling the power consumption of applications with one key insight: language runtime events are often correlated with a sustained level of power consumption. When compared with the established approach of power modeling based on hardware performance counters (HPCs), VESTA has the benefit of solely requiring application-scoped information and enabling a higher level of explainability, while achieving comparable or even higher precision. Through experiments performed on 37 real-world applications on the Java Virtual Machine (JVM), we find the power model built by VESTA is capable of predicting energy consumption with a mean absolute percentage error of 1.56%, while the monitoring of language runtime events incurs small performance and energy overhead. 
    more » « less
  3. null (Ed.)
    Parallel filesystems (PFSs) are one of the most critical high-availability components of High Performance Computing (HPC) systems. Most HPC workloads are dependent on the availability of a POSIX compliant parallel filesystem that provides a globally consistent view of data to all compute nodes of a HPC system. Because of this central role, failure or performance degradation events in the PFS can impact every user of a HPC resource. There is typically insufficient information available to users and even many HPC staff to identify the causes of these PFS events, impeding the implementation of timely and targeted remedies to PFS issues. The relevant information is distributed across PFS servers; however, access to these servers is highly restricted due to the sensitive role they play in the operations of a HPC system. Additionally, the information is challenging to aggregate and interpret, relegating diagnosis and treatment of PFS issues to a select few experts with privileged system access. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. 
    more » « less
  4. Abstract. Permafrost thaw has been observed at several locations across the Arctic tundra in recent decades; however, the pan-Arctic extent and spatiotemporal dynamics of thaw remains poorly explained. Thaw-induced differential ground subsidence and dramatic microtopographic transitions, such as transformation of low-centered ice-wedge polygons (IWPs) into high-centered IWPs can be characterized using very high spatial resolution (VHSR) commercial satellite imagery. Arctic researchers demand for an accurate estimate of the distribution of IWPs and their status across the tundra domain. The entire Arctic has been imaged in 0.5 m resolution by commercial satellite sensors; however, mapping efforts are yet limited to small scales and confined to manual or semi-automated methods. Knowledge discovery through artificial intelligence (AI), big imagery, and high performance computing (HPC) resources is just starting to be realized in Arctic science. Large-scale deployment of VHSR imagery resources requires sophisticated computational approaches to automated image interpretation coupled with efficient use of HPC resources. We are in the process of developing an automated Mapping Application for Permafrost Land Environment (MAPLE) by combining big imagery, AI, and HPC resources. The MAPLE uses deep learning (DL) convolutional neural nets (CNNs) algorithms on HPCs to automatically map IWPs from VHSR commercial satellite imagery across large geographic domains. We trained and tasked a DLCNN semantic object instance segmentation algorithm to automatically classify IWPs from VHSR satellite imagery. Overall, our findings demonstrate the robust performances of IWP mapping algorithm in diverse tundra landscapes and lay a firm foundation for its operational-level application in repeated documentation of circumpolar permafrost disturbances. 
    more » « less
  5. The Hogan Personality Inventory (HPI) and Hogan Developmental Survey (HDS) are among the most widely used and extensively well-validated personality inventories for organizational applications; however, they are rarely used in basic research. We describe the Hogan Personality Content Single-Items (HPCS) inventory, an inventory designed to measure the 74 content subscales of the HPI and HDS via a single-item each. We provide evidence of the reliability and validity of the HPCS, including item-level retest reliability estimates, both self-other agreement and other-other (or observer) agreement, convergent correlations with the corresponding scales from the full HPI/HDS instruments, and analyze how similarly the HPCS and full HPI/HDS instruments relate to other variables. We discuss situations where administering the HPCS may have certain advantages and disadvantages relative to the full HPI and HDS. We also discuss how the current findings contribute to an emerging picture of best practices for the development and use of inventories consisting of single-item scales. 
    more » « less