skip to main content


Title: Who's Afraid of Uncorrectable Bit Errors? Online Recovery of Flash Errors with Distributed Redundancy
Due to its high performance and decreasing cost per bit, flash storage is the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper, we propose addressing the flash lifetime problem by allowing devices to expose higher bit error rates. We present DIRECT, a set of techniques that harnesses distributed-level redundancy to enable the adoption of new generations of denser and less reliable flash storage technologies. DIRECT does so by using an end-to-end approach to increase the reliability of distributed storage systems. We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store in production at Facebook and backed by RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 100x and recovery time by more than 10,000x. DIRECT also allows HDFS to tolerate a 10,000--100,000x higher bit error rate without experiencing application-visible errors. By significantly increasing the availability and durability of distributed storage systems in the face of bit errors, DIRECT helps extend flash lifetimes.  more » « less
Award ID(s):
1763546
NSF-PAR ID:
10127754
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
USENIX Annual Technical Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should produce identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues. 
    more » « less
  2. Flash memory devices are winning the competition for storage density against magnetic recording devices. This outcome results from advances in physics that allow storage of more than one bit per cell, coupled with advances in signal processing that reduce the effect of physical instabilities. Constrained codes are used in storage to avoid problematic patterns. Recently, we introduced binary symmetric lexicographically-ordered constrained codes (LOCO codes) for data storage and transmission. This paper introduces simple constrained codes that support non-binary physical gates in multi, triple, quad, and the currently-in-development penta-level cell (M/T/Q/P-LC) Flash memories. The new codes can be easily modified if problematic patterns change with time. These codes are designed to mitigate inter-cell interference, which is a critical source of error in Flash devices. The new codes are called q-ary asymmetric LOCO codes (QA-LOCO codes), and the construction subsumes codes previously designed for single-level cell (SLC) Flash devices (ALOCO codes). QA-LOCO codes work for a Flash device with any number, q, of levels per cell. For q ≥ 4, we show that QA-LOCO codes can achieve rates greater than 0.95log 2 q information bits per coded symbol. Capacity-achieving rates, affordable encoding-decoding complexity, and ease of reconfigurability support the growing improvement of M/T/Q/P-LC Flash memory devices, as well as lifecycle management as the characteristics of these devices change with time. 
    more » « less
  3. Modern solid-state disks achieve high data transfer rates due to their massive internal parallelism. However, out-of-place updates for flash memory incur garbage collection costs when valid data needs to be copied during space reclamation. The root cause of this extra cost is that solid-state disks are not always able to accurately determine data lifetime and group together data that expires before the space needs to be reclaimed. Real-time systems found in autonomous vehicles, industrial control systems, and assembly-line robots store data from hundreds of sensors and often have predictable data lifetimes. These systems require guaranteed high storage bandwidth for read and write operations by mission-critical real-time tasks. In this article, we depart from the traditional block device interface to guarantee the high throughput needed to process large volumes of data. Using data lifetime information from the application layer, our proposed real-time design, called Telomere , is able to intelligently lay out data in NAND flash memory and eliminate valid page copies during garbage collection. Telomere’s real-time admission control is able to guarantee tasks their required read and write operations within their periods. Under randomly generated tasksets containing 500 tasks, Telomere achieves 30% higher throughput with a 5% storage cost compared to pre-existing techniques. 
    more » « less
  4. Agashe, Deepa (Ed.)
    Abstract Because errors at the DNA level power pathogen evolution, a systematic understanding of the rate and molecular spectra of mutations could guide the avoidance and treatment of infectious diseases. We thus accumulated tens of thousands of spontaneous mutations in 768 repeatedly bottlenecked lineages of 18 strains from various geographical sites, temporal spread, and genetic backgrounds. Entailing over ∼1.36 million generations, the resultant data yield an average mutation rate of ∼0.0005 per genome per generation, with a significant within-species variation. This is one of the lowest bacterial mutation rates reported, giving direct support for a high genome stability in this pathogen resulting from high DNA-mismatch-repair efficiency and replication-machinery fidelity. Pathogenicity genes do not exhibit an accelerated mutation rate, and thus, elevated mutation rates may not be the major determinant for the diversification of toxin and secretion systems. Intriguingly, a low error rate at the transcript level is not observed, suggesting distinct fidelity of the replication and transcription machinery. This study urges more attention on the most basic evolutionary processes of even the best-known human pathogens and deepens the understanding of their genome evolution. 
    more » « less
  5. In data storage and data transmission, certain patterns are more likely to be subject to error when written (transmitted) onto the media. In magnetic recording systems with binary data and bipolar non-return-to-zero signaling, patterns that have insufficient separation between consecutive transitions exacerbate inter-symbol interference. Constrained codes are used to eliminate such error-prone patterns. A recent example is a new family of capacity-achieving constrained codes, named lexicographically-ordered constrained codes (LOCO codes). LOCO codes are symmetric, that is, the set of forbidden patterns is closed under taking pattern complements. LOCO codes are suboptimal in terms of rate when used in Flash devices where block erasure is employed since the complement of an error-prone pattern is not detrimental in these devices. This paper introduces asymmetric LOCO codes (A-LOCO codes), which are lexicographically-ordered constrained codes that forbid only those patterns that are detrimental for Flash performance. A-LOCO codes are also capacity-achieving, and at finite-lengths, they offer higher rates than the available state-of-the-art constrained codes designed for the same goal. The mapping-demapping between the index and the codeword in A-LOCO codes allows low-complexity encoding and decoding algorithms that are simpler than their LOCO counterparts. 
    more » « less