The increasing reliance on robust data-driven decision-making across many domains has made it necessary for data management systems to manage many thousands to millions of versions of datasets, acquired or constructed at various stages of analysis pipelines over time. Delta encoding is an effective and widely-used solution to compactly store a large number of datasets, that simultaneously exploits redundancies across them and keeps the average retrieval cost of reconstructing any dataset low. However, supporting any kind of rich retrieval or querying functionality, beyond single dataset checkout, is challenging in such storage engines. In this paper, we initiate a systematic study of this problem, and present DEX, a novel stand-alone delta-oriented execution engine, whose goal is to take advantage of the already computed deltas between the datasets for efficient query processing. In this work, we study how to execute checkout, intersection, union and t-threshold queries over record-based files; we show that processing of even these basic queries leads to many new and unexplored challenges and trade-offs. Starting from a query plan that confines query execution to a small set of deltas, we introduce new transformation rules based on the algebraic properties of the deltas, that allow us to explore the search space of alternative plans. For the case of checkout, we present a dynamic programming algorithm to efficiently select the optimal query plan under our cost model, while we design efficient heuristics to select effective plans that vastly outperform the base checkout-then-query approach for other queries. A key characteristic of our query execution methods is that the computational cost is primarily dependent on the size and the number of deltas in the expression (typically small), and not the input dataset versions (which can be very large). We have implemented DEX prototype on top of git, a widely used version control system. We present an extensive experimental evaluation on synthetic data with diverse characteristics, that shows that our methods perform exceedingly well compared to the baseline.
more »
« less
RStore: A Distributed Multi-Version Document Store
We address the problem of compactly storing a large number of versions (snapshots) of a collection of keyed documents or records in a distributed environment, while efficiently answering a variety of retrieval queries over those, including retrieving full or partial versions, and evolution histories for specific keys. We motivate the increasing need for such a system in a variety of application domains, carefully explore the design space for building such a system and the various storage-computation-retrieval trade-offs, and discuss how different storage layouts influence those trade-offs. We propose a novel system architecture that satisfies the key desiderata for such a system, and offers simple tuning knobs that allow adapting to a specific data and query workload. Our system is intended to act as a layer on top of a distributed key-value store that houses the raw data as well as any indexes. We design novel off-line storage layout algorithms for efficiently partitioning the data to minimize the storage costs while keeping the retrieval costs low. We also present an online algorithm to handle new versions being added to system. Using extensive experiments on large datasets, we demonstrate that our system operates at the scale required in most practical scenarios and often outperforms standard baselines, including a delta-based storage engine, by orders-of-magnitude.
more »
« less
- Award ID(s):
- 1650755
- PAR ID:
- 10082797
- Date Published:
- Journal Name:
- 2018 IEEE 34th International Conference on Data Engineering (ICDE)
- Page Range / eLocation ID:
- 389 to 400
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Power is becoming a scarce resource for data centers, raising the need for power adaptive system design—the ability to dynamically change power consumption—to match available power. Storage makes up an increasing fraction of total data center power consumption. As such, it holds great potential to contribute to data center power adaptivity. To this end, we conduct a measurement study of power control mechanisms on a variety of modern data center storage devices. By changing device power states and shaping IO, we achieve a power dynamic range of up to 59.4% of the device’s maximum operating power. We also study power control trade-offs, including throughput and latency. Based on our observations, we construct storage device power-throughput models and discuss the implications on power adaptive storage system design.more » « less
-
The unique features of blockchains such as immutability, transparency, provenance, and authenticity have been used by many large-scale data management systems to deploy a wide range of distributed applications including supply chain management, healthcare, and crowdworking in permissioned settings. Unlike permissionless settings, e.g., Bitcoin, where the network is public, and anyone can participate without a specific identity, a permissioned blockchain system consists of a set of known, identified nodes that might not fully trust each other. While the characteristics of permissioned blockchains are appealing to a wide range of largescale data management systems, these systems, have to satisfy four main requirements: confidentiality, verifiability, performance, and scalability. Various approaches have been developed in industry and academia to satisfy these requirements with varying assumptions and costs. The focus of this tutorial is on presenting many of these techniques while highlighting the trade-offs among them. We demonstrate the practicality of such techniques in real-life by presenting three different applications, i.e., supply chain management, large-scale databases, and multi-platform crowdworking environments, and show how those techniques can be utilized to meet the requirements of such applicationsmore » « less
-
Green hydrogen, produced using renewables through electrolysis, can be used to reduce emissions in the hard-to-abate industrial sector. Efficient production and large-scale deployment require storage to mitigate electrolyzer degradation and ensure stable hydrogen supply. This paper explores the impacts and trade-offs of battery and hydrogen storage in off-grid wind-to-hydrogen systems, considering degradation of batteries and electrolyzers. Utilizing an optimization model, we examine system performance and costs over a wide range of storage capacities and wind profiles. Our results show that batteries smooth short-term fluctuations and minimize electrolyzer degradation but can experience significant degradation resulting from frequent charge/discharge cycles. Conversely, hydrogen storage provides long-term energy buffering, essential for sustained hydrogen production, but can increase electrolyzer cycling and degradation. Combining battery and hydrogen storage enhances system reliability, reduces component degradation, and reduces operational costs. This highlights the importance of strategic storage investments to improve the performance and costs of green hydrogen systems.more » « less
-
Abstract Artificial Intelligence is poised to transform the design of complex, large-scale detectors like ePIC at the future Electron Ion Collider. Featuring a central detector with additional detecting systems in the far forward and far backward regions, the ePIC experiment incorporates numerous design parameters and objectives, including performance, physics reach, and cost, constrained by mechanical and geometric limits.This project aims to develop a scalable, distributed AI-assisted detector design for the EIC (AID(2)E), employing state-of-the-art multiobjective optimization to tackle complex designs. Supported by the ePIC software stack and usingGeant4simulations, our approach benefits from transparent parameterization and advanced AI features.The workflow leverages the PanDA and iDDS systems, used in major experiments such as ATLAS at CERN LHC, the Rubin Observatory, and sPHENIX at RHIC, to manage the compute intensive demands of ePIC detector simulations. Tailored enhancements to the PanDA system focus on usability, scalability, automation, and monitoring.Ultimately, this project aims to establish a robust design capability, apply a distributed AI-assisted workflow to the ePIC detector, and extend its applications to the design of the second detector (Detector-2) in the EIC, as well as to calibration and alignment tasks. Additionally, we are developing advanced data science tools to efficiently navigate the complex, multidimensional trade-offs identified through this optimization process.more » « less
An official website of the United States government

