skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Lifespan and Failures of SSDs and HDDs: Similarities, Differences, and Prediction Models
Data center downtime typically centers around IT equipment failure. Storage devices are the most frequently failing components in data centers. We present a comparative study of hard disk drives (HDDs) and solid state drives (SSDs) that constitute the typical storage in data centers. Using six-year field data of 100,000 HDDs of different models from the same manufacturer from the Backblaze dataset and six-year field data of 30,000 SSDs of three models from a Google data center, we characterize the workload conditions that lead to failures. We illustrate that their root failure causes differ from common expectations and that they remain difficult to discern. For the case of HDDs we observe that young and old drives do not present many differences in their failures. Instead, failures may be distinguished by discriminating drives based on the time spent for head positioning. For SSDs, we observe high levels of infant mortality and characterize the differences between infant and non-infant failures. We develop several machine learning failure prediction models that are shown to be surprisingly accurate, achieving high recall and low false positive rates. These models are used beyond simple prediction as they aid us to untangle the complex interaction of workload characteristics that lead to failures and identify failure root causes from monitored symptoms.  more » « less
Award ID(s):
1838022
PAR ID:
10486306
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
IEEE Explore
Date Published:
Journal Name:
IEEE Transactions on Dependable and Secure Computing
Volume:
20
Issue:
1
ISSN:
1545-5971
Page Range / eLocation ID:
256 to 272
Subject(s) / Keyword(s):
Supervised learning, classification, data centers, storage devices, SSD, HDD
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In the US alone, data centers consumed around $20 billion (200 TWh) yearly electricity in 2016, and this amount doubles itself every five years. Data storage alone is estimated to be responsible for about 25% to 35% of data-center power consumption. Servers in data centers generally include multiple HDDs or SSDs, commonly arranged in a RAID level for better performance, reliability, and availability. In this study, we evaluate HDD and SSD based Linux (md) software RAIDs' impact on the energy consumption of popular servers. We used the Filebench workload generator to emulate three common server workloads: web, file, and mail, and measured the energy consumption of the system using the HOBO power meter. We observed some similarities and some differences in energy consumption characteristics of HDD and SSD RAIDs, and provided our insights for better energy-efficiency. We hope that our observations will shed light on new energy-efficient RAID designs tailored for HDD and SSD RAIDs' specific energy consumption characteristics. 
    more » « less
  2. Escalating application demand and the end of Dennard scaling have put energy management at the center of cloud operations. Because of the huge cost and long lead time of provisioning new data centers, operators want to squeeze as much use out of existing data centers as possible, often limited by power provisioning fixed at the time of construction. Workload demand spikes and the inherent variability of renewable energy, as well as increased power unreliability from extreme weather events and natural disasters, make the data center power management problem even more challenging. We believe it is time to build a power control plane to provide fine-grained observability and control over data center power to operators. Our goal is to help make data centers substantially more elastic with respect to dynamic changes in energy sources and application needs, while still providing good performance to applications. There are many use cases for cloud power control, including increased power oversubscription and use of green energy, resilience to power failures, large-scale power demand response, and improved energy efficiency. 
    more » « less
  3. null (Ed.)
    Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead. 
    more » « less
  4. Field programmable gate arrays (FPGAs) are used in large numbers in data centers around the world. They are used for cloud computing and computer networking. The most common type of FPGA used in data centers are re-programmable SRAM-based FPGAs. These devices offer potential performance and power consumption savings. A single device also carries a small susceptibility to radiation-induced soft errors, which can lead to unexpected behavior. This article examines the impact of terrestrial radiation on FPGAs in data centers. Results from artificial fault injection and accelerated radiation testing on several data-center-like FPGA applications are compared. A new fault injection scheme provides results that are more similar to radiation testing. Silent data corruption (SDC) is the most commonly observed failure mode followed by FPGA unavailable and host unresponsive. A hypothetical deployment of 100,000 FPGAs in Denver, Colorado, will experience upsets in configuration memory every half-hour on average and SDC failures every 0.5–11 days on average. 
    more » « less
  5. This paper introduces sharable backup as a novel solution to failure recovery in data center networks. It allows the entire network to share a small pool of backup devices. This proposal is grounded in three key observations. First, the traditional rerouting-based failure recovery is ineffective, because bandwidth loss from failures degrades application performance drastically. Therefore, failed devices should be replaced to restore bandwidth. Second, failures in data centers are rare but destructive [11], so it is desirable to seek cost-effective backup options. Third, the emergence of configurable data center network architectures promises feasibility of bringing backup devices online dynamically. We design the ShareBackup prototype architecture to realize this idea. Compared to rerouting-based solutions, ShareBackup provides more bandwidth with short path length at low cost. 
    more » « less