skip to main content


Title: In Datacenter Performance, The Only Constant Is Change
All computing infrastructure suffers from performance variability, be it bare-metal or virtualized. This phenomenon originates from many sources: some transient, such as noisy neighbors, and others more permanent but sudden, such as changes or wear in hardware, changes in the underlying hypervisor stack, or even undocumented interactions between the policies of the computing resource provider and the active workloads. Thus, performance measurements obtained on clouds, HPC facilities, and, more generally, datacenter environments are almost guaranteed to exhibit performance regimes that evolve over time, which leads to undesirable nonstationarities in application performance. In this paper, we present our analysis of performance of the bare-metal hardware available on the CloudLab testbed where we focus on quantifying the evolving performance regimes using changepoint detection. We describe our findings, backed by a dataset with nearly 6.9M benchmark results collected from over 1600 machines over a period of 2 years and 9 months. These findings yield a comprehensive characterization of real-world performance variability patterns in one computing facility, a methodology for studying such patterns on other infrastructures, and contribute to a better understanding of performance variability in general.  more » « less
Award ID(s):
1743363
NSF-PAR ID:
10197400
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the Twentieth IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGrid
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Empirical performance measurements of computer systems almost always exhibit variability and anomalies. Run-to-run and server-to-server variations are common for CPU, memory, disk, and network performance characteristics. In our previous work, we focused on taming performance variability for memory, disk, and network and established an interactive analysis service at: https://confirm.fyi/ to help users of the CloudLab testbed better plan and conduct their experiments. In this paper, we describe our analysis of CPU variability based on over 1.3M performance measurements from nearly 1,800 servers and present our initial findings. The focus of this work is on capturing hardware variability, which can make repeatable experiments more difficult and can impact conclusions; it it this important for systems researchers to understand. (We note that, though we do not study it in this work, in the cloud, multi-tenancy and resource sharing an exacerbate the problem.) Variability also inevitably impacts performance and operation of middleware and high-level applications, contributing to the straggler problems in many domains, including HPC, Big Data, and Machine Learning, and on many types of cyberinfrastructures. We analyze the data from the CloudLab servers allocated in an exclusive fashion, with no virtualization. While our analysis focuses on the testbed that aims to promote reproducible research, we believe our approach and the findings can be of value to people who manage, analyze, and utilize shared computing resources in supercomputers, clouds, and datacenters. 
    more » « less
  2. In this paper we derive a new capability for robots to measure relative direction, or Angle-of-Arrival (AOA), to other robots, while operating in non-line-of-sight and unmapped environments, without requiring external infrastructure. We do so by capturing all of the paths that a WiFi signal traverses as it travels from a transmitting to a receiving robot in the team, which we term as an AOA profile. The key intuition behind our approach is to emulate antenna arrays in the air as a robot moves freely in 2D or 3D space. The small differences in the phase and amplitude of WiFi signals are thus processed with knowledge of a robots’ local displacements (often provided via inertial sensors) to obtain the profile, via a method akin to Synthetic Aperture Radar (SAR). The main contribution of this work is the development of i) a framework to accommodate arbitrary 2D and 3D trajectories, as well as continuous mobility of both transmitting and receiving robots, while computing AOA profiles between them and ii) an accompanying analysis that provides a lower bound on variance of AOA estimation as a function of robot trajectory geometry that is based on the Cramer Rao Bound and antenna array theory. This is a critical distinction with previous work on SAR that restricts robot mobility to prescribed motion patterns, does not generalize to the full 3D space, and/or requires transmitting robots to be static during data acquisition periods. In fact, we find that allowing robots to use their full mobility in 3D space while performing SAR, results in more accurate AOA profiles and thus better AOA estimation. We formally characterize this observation as the informativeness of the trajectory; a computable quantity for which we derive a closed form. All theoretical developments are substantiated by extensive simulation and hardware experiments on air/ground robot platforms. Our experimental results bolster our theoretical findings, demonstrating that 3D trajectories provide enhanced and consistent accuracy, with AOA error of less than 10 deg for 95% of trials. We also show that our formulation can be used with an off-the-shelf trajectory estimation sensor (Intel RealSense T265 tracking camera), for estimating the robots’ local displacements, and we provide theoretical as well as empirical results that show the impact of typical trajectory estimation errors on the measured AOA. Finally, we demonstrate the performance of our system on a multi-robot task where a heterogeneous air/ground pair of robots continuously measure AOA profiles over a WiFi link to achieve dynamic rendezvous in an unmapped, 300 square meter environment with occlusions. 
    more » « less
  3. In this paper, we develop the analytical framework for a novel Wireless signal-based Sensing capability for Robotics (WSR) by leveraging a robots’ mobility in 3D space. It allows robots to primarily measure relative direction, or Angle-of-Arrival (AOA), to other robots, while operating in non-line-of-sight unmapped environments and without requiring external infrastructure. We do so by capturing all of the paths that a wireless signal traverses as it travels from a transmitting to a receiving robot in the team, which we term as an AOA profile. The key intuition behind our approach is to enable a robot to emulate antenna arrays as it moves freely in 2D and 3D space. The small differences in the phase of the wireless signals are thus processed with knowledge of robots’ local displacement to obtain the profile, via a method akin to Synthetic Aperture Radar (SAR). The main contribution of this work is the development of (i) a framework to accommodate arbitrary 2D and 3D motion, as well as continuous mobility of both signal transmitting and receiving robots, while computing AOA profiles between them and (ii) a Cramer–Rao Bound analysis, based on antenna array theory, that provides a lower bound on the variance in AOA estimation as a function of the geometry of robot motion. This is a critical distinction with previous work on SAR-based methods that restrict robot mobility to prescribed motion patterns, do not generalize to the full 3D space, and require transmitting robots to be stationary during data acquisition periods. We show that allowing robots to use their full mobility in 3D space while performing SAR results in more accurate AOA profiles and thus better AOA estimation. We formally characterize this observation as the informativeness of the robots’ motion, a computable quantity for which we derive a closed form. All analytical developments are substantiated by extensive simulation and hardware experiments on air/ground robot platforms using 5 GHz WiFi. Our experimental results bolster our analytical findings, demonstrating that 3D motion provides enhanced and consistent accuracy, with a total AOA error of less than 10for 95% of trials. We also analytically characterize the impact of displacement estimation errors on the measured AOA and validate this theory empirically using robot displacements obtained using an off-the-shelf Intel Tracking Camera T265. Finally, we demonstrate the performance of our system on a multi-robot task where a heterogeneous air/ground pair of robots continuously measure AOA profiles over a WiFi link to achieve dynamic rendezvous in an unmapped, 300 m2environment with occlusions.

     
    more » « less
  4. Abstract

    Aeolian sediment transport occurs as a function of, and with feedback to ecosystem changes and disturbances. Many desert grasslands are undergoing rapid changes in vegetation, including the encroachment of woody plants, which alters fire regimes and in turn can change the spatial and temporal patterns of aeolian sediment transport. We investigated aeolian sediment transport and spatial distribution of sediment in the surface soil for 7 years following a prescribed fire using a multiple rare earth element (REE) tracer‐based approach in a shrub‐encroached desert grassland in the northern Chihuahuan desert. Results indicate that even though the aeolian horizontal sediment mass flux increased approximately three‐fold in the first windy season in the burned areas compared to control areas, there were no significant differences after three windy seasons. The soil surface of bare microsites was the major contributor of aeolian sediments in unburned areas (87%), while the shrub microsites contributed the least (<2%) during the observation period. However, after the prescribed fire, the contribution of aeolian sediments from shrub microsites increased considerably (∼40%), indicating post‐fire microsite‐scale sediment redistribution. The findings of this study, which is the first to use multiple REE tracers for multi‐year analysis of the spatial and temporal dynamics of aeolian sediment transport, illustrate how disturbance by prescribed fire can influence aeolian processes and alters dryland soil geomorphology in which distinct soils develop over time at very fine spatial scales of individual plants.

     
    more » « less
  5. In-memory key-value stores that use kernel-bypass networking serve millions of operations per second per machine with microseconds of latency. They are fast in part because they are simple, but their simple interfaces force applications to move data across the network. This is inefficient for operations that aggregate over large amounts of data, and it causes delays when traversing complex data structures. Ideally, applications could push small functions to storage to avoid round trips and data movement; however, pushing code to these fast systems is challenging. Any extra complexity for interpreting or isolating code cuts into their latency and throughput benefits. We present Splinter, a low-latency key-value store that clients extend by pushing code to it. Splinter is designed for modern multi-tenant data centers; it allows mutually distrusting tenants to write their own fine-grained extensions and push them to the store at runtime. The core of Splinter’s design relies on type- and memory-safe extension code to avoid conventional hardware isolation costs. This still allows for bare-metal execution, avoids data copying across trust boundaries, and makes granular storage functions that perform less than a microsecond of compute practical. Our measurements show that Splinter can process 3.5 million remote extension invocations per second with a median round-trip latency of less than 9 μs at densities of more than 1,000 tenants per server. We provide an implementation of Facebook’s TAO as an 800 line extension that, when pushed to a Splinter server, improves performance by 400 Kop/s to perform 3.2 Mop/s over online graph data with 30 μs remote access times. 
    more » « less