skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A principled approach for selecting block I/O traces
We present IOTap, a tool that analyzes and profiles block I/O traces. IOTap computes the (dis)similarities among a set of workloads and sets a guideline for selecting a subset of traces for benchmarking. By doing so, we avoid experimentally running all workloads or, even worse, arbitrarily selecting a subset that skews the results.We demonstrate the usefulness of IOTap by comparing its results with experiments on real SSDs, achieving a high correlation of 0.92 for an NVMe SSD.  more » « less
Award ID(s):
1822165 2008453
PAR ID:
10351798
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM Digital Library, HotStorage
Date Published:
Journal Name:
14th ACM Workshop on Hot Topics in Storage and File Systems
Page Range / eLocation ID:
52 to 58
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In this paper, we present a new benchmark to validate the suitability of database systems for interactive visualization workloads. While there exist proposals for evaluating database systems on interactive data exploration workloads, none rely on real user traces for database benchmarking. To this end, our long term goal is to collect user traces that represent workloads with different exploration characteristics. In this paper, we present an initial benchmark that focuses on "crossfilter"-style applications, which are a popular interaction type for data exploration and a particularly demanding scenario for testing database system performance. We make our benchmark materials, including input datasets, interaction sequences, corresponding SQL queries, and analysis code, freely available as a community resource, to foster further research in this area: https://osf.io/9xerb/?view_only=81de1a3f99d04529b6b173a3bd5b4d23. 
    more » « less
  2. Understanding the different workloaddependent factors that impact the latency or reliability of a storage system is essential for SLA satisfaction and fair resource provisioning. However, due to the volatility of system behavior under multiple workloads, determining even the number of concurrent types of workload functions, a necessary precursor to workload separation, is an unsolved problem in the general case. We introduce CENSUS, a novel classification framework that combines time-series analysis with gradient boosting to identify the number of functional workloads in a shared storage system by projecting workload traces into a high-dimensional feature representation space. We show that CENSUS can distinguish the number of interleaved workloads in a real-world trace segment with up to 95% accuracy, leading to a decrement of the mean square error to as little as 5% compared to the 
    more » « less
  3. Caching techniques are widely used in today’s computing infrastructure from virtual memory management to server cache and memory cache. This paper builds on two observa- tions. First, the space utilization in cache can be improved by varying the cache size based on dynamic application demand. Second, it is easier to predict application behavior statistically than precisely. This paper presents a new variable-size cache that uses statistical knowledge of program behavior to maximize the cache performance. We measure performance using data access traces from real-world workloads, including Memcached traces from Facebook and storage traces from Microsoft Research. In an offline setting, the new cache is demonstrated to outperform even OPT, the optimal fixed- size cache which makes use of precise knowledge of program behavior. 
    more » « less
  4. Data-intensive analytical applications need to support both efficient reads and writes. However, what is usually a good data layout for an update-heavy workload, is not well-suited for a read-mostly one and vice versa. Modern analytical data systems rely on columnar layouts and employ delta stores to inject new data and updates. We show that for hybrid workloads we can achieve close to one order of magnitude better performance by tailoring the column layout design to the data and query workload. Our approach navigates the possible design space of the physical layout: it organizes each column’s data by determining the number of partitions, their corresponding sizes and ranges, and the amount of buffer space and how it is allocated. We frame these design decisions as an optimization problem that, given workload knowledge and performance requirements, provides an optimal physical layout for the workload at hand. To evaluate this work, we build an in-memory storage engine, Casper, and we show that it outperforms state-of-the-art data layouts of analytical systems for hybrid workloads. Casper delivers up to 2.32x higher throughput for update-intensive workloads and up to 2.14x higher throughput for hybrid workloads. We further show how to make data layout decisions robust to workload variation by carefully selecting the input of the optimization. 
    more » « less
  5. Performance variability has been acknowledged as a problem for over a decade by cloud practitioners and performance engineers. Yet, our survey of top systems conferences reveals that the research community regularly disregards variability when running experiments in the cloud. Focusing on networks, we assess the impact of variability on cloud-based big-data workloads by gathering traces from mainstream commercial clouds and private research clouds. Our data collection consists of millions of datapoints gathered while transferring over 9 petabytes of data. We characterize the network variability present in our data and show that, even though commercial cloud providers implement mechanisms for quality-of-service enforcement, variability still occurs, and is even exacerbated by such mechanisms and service provider policies. We show how big-data workloads suffer from significant slowdowns and lack predictability and replicability, even when state-of-the-art experimentation techniques are used. We provide guidelines for practitioners to reduce the volatility of big data performance, making experiments more repeatable. 
    more » « less