skip to main content

Title: Mapping Datasets to Object Storage System
Access libraries such as ROOT[1] and HDF5[2] allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. For example, access libraries often implement buffering and data layout that assume that large, single-threaded sequential access patterns are causing less overall latency than small parallel random access: while this is true for spinning media, it is not true for flash media. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph’s extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more » more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations. « less
Award ID(s):
1764102 1705021
Publication Date:
Journal Name:
EPJ web of conferences
Sponsoring Org:
National Science Foundation
More Like this
  1. Many applications are increasingly becoming I/O-bound. To improve scalability, analytical models of parallel I/O performance are often consulted to determine possible I/O optimizations. However, I/O performance modeling has predominantly focused on applications that directly issue I/O requests to a parallel file system or a local storage device. These I/O models are not directly usable by applications that access data through standardized I/O libraries, such as HDF5, FITS, and NetCDF, because a single I/O request to an object can trigger a cascade of I/O operations to different storage blocks. The I/O performance characteristics of applications that rely on these libraries ismore »a complex function of the underlying data storage model, user-configurable parameters and object-level access patterns. As a consequence, I/O optimization is predominantly an ad-hoc process that is performed by application developers, who are often domain scientists with limited desire to delve into nuances of the storage hierarchy of modern computers.This paper presents an analytical cost model to predict the end-to-end execution time of applications that perform I/O through established array management libraries. The paper focuses on the HDF5 and Zarr array libraries, as examples of I/O libraries with radically different storage models: HDF5 stores every object in one file, while Zarr creates multiple files to store different objects. We find that accessing array objects via these I/O libraries introduces new overheads and optimizations. Specifically, in addition to I/O time, it is crucial to model the cost of transforming data to a particular storage layout (memory copy cost), as well as model the benefit of accessing a software cache. We evaluate the model on real applications that process observations (neuroscience) and simulation results (plasma physics). The evaluation on three HPC clusters reveals that I/O accounts for as little as 10% of the execution time in some cases, and hence models that only focus on I/O performance cannot accurately capture the performance of applications that use standard array storage libraries. In parallel experiments, our model correctly predicts the fastest storage library between HDF5 and Zarr 94% of the time, in contrast with 70% of the time for a cutting-edge I/O model.« less
  2. The Skyhook Data Management project ( at the Center for Research in Open Source Software ( at UC Santa Cruz implements customized extensions through Ceph's object class interface that enables offloading database operations to the storage system. In our previous Vault '19 talk, we showed how SkyhookDM can transparently scale out databases. The SkyhookDM Ceph extensions are an example of our 'programmable storage' research efforts at UCSC, and can be accessed through commonly available external/foreign table database interfaces. Utilizing fast in-memory serialization libraries such as Google Flatbuffers and Apache Arrow, SkyhookDM currently implements common database functions such as SELECT, PROJECT,more »AGGREGATE, and indexing inside Ceph, along with lower-level data manipulations such as transforming data from row to column formats on RADOS servers. In this talk, we will present three of our latest developments on the SkyhookDM project since Vault '19. First, SkyhookDM can be used to also offload operations of access libraries that support plugins for backends, such as HDF5 and its Virtual Object Layer. Second, in addition to row-oriented data format using Google's Flatbuffers, we have added support for column-oriented data formats using the Apache Arrow library within our Ceph extensions. Third, we added dynamic switching between row and column data formats within Ceph objects, a first step towards physical design management in storage systems, similar to physical design tuning in database systems.« less
  3. Computer systems utilizing byte-addressable Non-Volatile Memory ( NVM ) as memory/storage can provide low-latency data persistence. The widely used key-value stores using Log-Structured Merge Tree ( LSM-Tree ) are still beneficial for NVM systems in aspects of the space and write efficiency. However, the significant write amplification introduced by the leveled compaction of LSM-Tree degrades the write performance of the key-value store and shortens the lifetime of the NVM devices. The existing studies propose new compaction methods to reduce write amplification. Unfortunately, they result in a relatively large read amplification. In this article, we propose NVLSM, a key-value store formore »NVM systems using LSM-Tree with new accumulative compaction. By fully utilizing the byte-addressability of NVM, accumulative compaction uses pointers to accumulate data into multiple floors in a logically sorted run to reduce the number of compactions required. We have also proposed a cascading searching scheme for reads among the multiple floors to reduce read amplification. Therefore, NVLSM reduces write amplification with small increases in read amplification. We compare NVLSM with key-value stores using LSM-Tree with two other compaction methods: leveled compaction and fragmented compaction. Our evaluations show that NVLSM reduces write amplification by up to 67% compared with LSM-Tree using leveled compaction without significantly increasing the read amplification. In write-intensive workloads, NVLSM reduces the average latency by 15.73%–41.2% compared to other key-value stores.« less
  4. Replication is essential for fault-tolerance. However, in in-memory systems, it is a source of high overhead. Remote direct memory access (RDMA) is attractive to create redundant copies of data, since it is low-latency and has no CPU overhead at the target. However, existing approaches still result in redundant data copying and active receivers. To ensure atomic data transfers, receivers check and apply only fully received messages. Tailwind is a zero-copy recovery-log replication protocol for scale-out in-memory databases. Tailwind is the first replication protocol that eliminates all CPU-driven data copying and fully bypasses target server CPUs, thus leaving backups idle. Tailwindmore »ensures all writes are atomic by leveraging a protocol that detects incomplete RDMA transfers. Tailwind substantially improves replication throughput and response latency compared with conventional RPC-based replication. In symmetric systems where servers both serve requests and act as replicas, Tailwind also improves normal-case throughput by freeing server CPU resources for request processing. We implemented and evaluated Tailwind on RAMCloud, a low-latency in-memory storage system. Experiments show Tailwind improves RAMCloud's normal-case request processing throughput by 1.7x. It also cuts down writes median and 99th percentile latencies by 2x and 3x respectively.« less
  5. The traditional von Neumann architecture limits the increase in computing efficiency and results in massive power consumption in modern computers due to the separation of storage and processing units. The novel neuromorphic computation system, an in-memory computing architecture with low power consumption, is aimed to break the bottleneck and meet the needs of the next generation of artificial intelligence (AI) systems. Thus, it is urgent to find a memory technology to implement the neuromorphic computing nanosystem. Nowadays, the silicon-based flash memory dominates non-volatile memory market, however, it is facing challenging issues to achieve the requirements of future data storage devicemore »development due to the drawbacks, such as scaling issue, relatively slow operation speed, and high voltage for program/erase operations. The emerging resistive random-access memory (RRAM) has prompted extensive research as its simple two-terminal structure, including top electrode (TE) layer, bottom electrode (BE) layer, and an intermediate resistive switching (RS) layer. It can utilize a temporary and reversible dielectric breakdown to cause the RS phenomenon between the high resistance state (HRS) and the low resistance state (LRS). RRAM is expected to outperform conventional memory device with the advantages, notably its low-voltage operation, short programming time, great cyclic stability, and good scalability. Among the materials for RS layer, indium gallium zinc oxide (IGZO) has shown attractive prospects in abundance and high atomic diffusion property of oxygen atoms, transparency. Additionally, its electrical properties can be easily modulated by controlling the stoichiometric ratio of indium and gallium as well as oxygen potential in the sputter gas. Moreover, since the IGZO can be applied to both the thin-film transistor (TFT) channel and RS layer, it has a great potential for fully integrated transparent electronics application. In this work, we proposed amorphous transparent IGZO-based RRAMs and investigated switching behaviors of the memory cells prepared with different top electrodes. First, ITO was choosing to serve as both TE and BE to achieve high transmittance. A multi-target magnetron sputtering system was employed to deposit all three layers (TE, RS, BE layers) on glass substrate. I-V characteristics were evaluated by a semiconductor parameter analyzer, and the bipolar RS feature of our RRAM devices was demonstrated by typical butterfly curves. The optical transmission analysis was carried out via a UV-Vis spectrometer and the average transmittance was around 80% out of entire devices in the visible-light wavelength range, implying high transparency. We adjusted the oxygen partial pressure during the sputtering of IGZO to optimize the property because the oxygen vacancy concentration governs the RS performance. Electrode selection is crucial and can impact the performance of the whole device. Thus, Cu TE was chosen for our second type of device because the diffusion of Cu ions can be beneficial for the formation of the conductive filament (CF). A ~5 nm SiO 2 barrier layer was employed between TE and RS layers to confine the diffusion of Cu into the RS layer. At the same time, this SiO 2 inserting layer can provide an additional interfacial series resistance in the device to lower the off current, consequently, improve the on/off ratio and whole performance. Finally, an oxygen affinity metal Ti was selected as the TE for our third type of device because the concentration of the oxygen atoms can be shifted towards the Ti electrode, which provides an oxygengettering activity near the Ti metal. This process may in turn lead to the formation of a sub-stoichiometric region in the neighboring oxide that is believed to be the origin of better performance. In conclusion, the transparent amorphous IGZO-based RRAMs were established. To tune the property of RS layer, the sputtering conditions of RS were varied. To investigate the influence of TE selections on switching performance of RRAMs, we integrated a set of TE materials, and a barrier layer on IGZO-based RRAM and compared the switch characteristics. Our encouraging results clearly demonstrate that IGZO is a promising material in RRAM applications and breaking the bottleneck of current memory technologies.« less