Many scientific applications opt for particles instead of meshes as their basic primitives to model complex systems composed of billions of discrete entities. Such applications span a diverse array of scientific domains, including molecular dynamics, cosmology, computational fluid dynamics, and geology. The scale of the particles in those scientific applications increases substantially thanks to the ever-increasing computational power in high-performance computing (HPC) platforms. However, the actual gains from such increases are often undercut by obstacles in data management systems related to data storage, transfer, and processing. Lossy compression has been widely recognized as a promising solution to enhance scientific data management systems regarding such challenges, although most existing compression solutions are tailored for Cartesian grids and thus have sub-optimal results on discrete particle data. In this paper, we introduce LCP, an innovative lossy compressor designed for particle datasets, offering superior compression quality and higher speed than existing compression solutions. Specifically, our contribution is threefold. (1) We propose LCP-S, an error-bound aware block-wise spatial compressor to efficiently reduce particle data size while satisfying the pre-defined error criteria. This approach is universally applicable to particle data across various domains, eliminating the need for reliance on specific application domain characteristics. (2) We develop LCP, a hybrid compression solution for multi-frame particle data, featuring dynamic method selection and parameter optimization. It aims to maximize compression effectiveness while preserving data quality as much as possible by utilizing both spatial and temporal domains. (3) We evaluate our solution alongside eight state-of-the-art alternatives on eight real-world particle datasets from seven distinct domains. The results demonstrate that our solution achieves up to 104% improvement in compression ratios and up to 593% increase in speed compared to the second-best option, under the same error criteria.
more »
« less
Multidimensional Array Data Management
Multidimensional arrays are a fundamental abstraction to represent data across scientific domains ranging from astronomy to genetics, medicine, business intelligence, and engineering. Arrays come under multiple shapes — from dense rasters to sparse data cubes and tensors — and have been studied extensively across many computing domains. In this survey, we provide a comprehensive guide for past, present, and future research in array data management from a database perspective. Unlike previous surveys that are limited to raster processing in the context of scientific data, we consider all types of arrays — rasters, data cubes, and tensors. We identify and analyze the most important research ideas on arrays proposed over time. We cover all data management aspects, from array algebras and query languages to storage strategies, execution techniques, and operator implementations. Moreover, we discuss which research ideas are adopted in real systems and how are they integrated in complete data processing pipelines. Finally, we compare arrays with the relational data model. The result is a thorough survey on array data management that should be consulted by anyone interested in this research topic — independent of experience level.
more »
« less
- Award ID(s):
- 2008815
- PAR ID:
- 10545560
- Publisher / Repository:
- FnT DB
- Date Published:
- Journal Name:
- Foundations and Trends® in Databases
- Volume:
- 12
- Issue:
- 2-3
- ISSN:
- 1931-7883
- Page Range / eLocation ID:
- 69 to 220
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Scientific data analysis pipelines face scalability bottlenecks when processing massive datasets that consist of millions of small files. Such datasets commonly arise in domains as diverse as detecting supernovae and post-processing computational fluid dynamics simulations. Furthermore, applications often use inference frameworks such as TensorFlow and PyTorch whose naive I/O methods exacerbate I/O bottlenecks. One solution is to use scientific file formats, such as HDF5 and FITS, to organize small arrays in one big file. However, storing everything in one file does not fully leverage the heterogeneous data storage capabilities of modern clusters. This paper presents Henosis, a system that intercepts data accesses inside the HDF5 library and transparently redirects I/O to the in-memory Redis object store or the disk-based TileDB array store. During this process, Henosis consolidates small arrays into bigger chunks and intelligently places them in data stores. A critical research aspect of Henosis is that it formulates object consolidation and data placement as a single optimization problem. Henosis carefully constructs a graph to capture the I/O activity of a workload and produces an initial solution to the optimization problem using graph partitioning. Henosis then refines the solution using a hill-climbing algorithm which migrates arrays between data stores to minimize I/O cost. The evaluation on two real scientific data analysis pipelines shows that consolidation with Henosis makes I/O 300× faster than directly reading small arrays from TileDB and 3.5× faster than workload-oblivious consolidation methods. Moreover, jointly optimizing consolidation and placement in Henosis makes I/O 1.7× faster than strategies that perform consolidation and placement independently.more » « less
-
Context.Stars form preferentially in clusters embedded inside massive molecular clouds, many of which contain high-mass stars. Thus, a comprehensive understanding of star formation requires a robust and statistically well-constrained characterization of the formation and early evolution of these high-mass star clusters. To achieve this, we designed the ALMAGAL Large Program that observed 1017 high-mass star-forming regions distributed throughout the Galaxy, sampling different evolutionary stages and environmental conditions. Aims.In this work, we present the acquisition and processing of the ALMAGAL data. The main goal is to set up a robust pipeline that generates science-ready products, that is, continuum and spectral cubes for each ALMAGAL field, with a good and uniform quality across the whole sample. Methods.ALMAGAL observations were performed with the Atacama Large Millimeter/submillimeter Array (ALMA). Each field was observed in three different telescope arrays, being sensitive to spatial scales ranging from ≈1000 au up to ≈0.1 pc. The spectral setup allows sensitive (≈0.1 mJy beam−1) imaging of the continuum emission at 219 GHz (or 1.38 mm), and it covers multiple molecular spectral lines observed in four different spectral windows that span about ≈4 GHz in frequency coverage. We have designed a Python-based processing workflow to calibrate and image these observational data. This ALMAGAL pipeline includes an improved continuum determination, suited for line-rich sources; an automatic self-calibration process that reduces phase-noise fluctuations and improves the dynamical range by up to a factor ≈5 in about 15% of the fields; and the combination of data from different telescope arrays to produce science-ready, fully combined images. Results.The final products are a set of uniformly generated continuum images and spectral cubes for each ALMAGAL field, including individual-array and combined-array products. The fully combined products have spatial resolutions in the range 800–2000 au, and mass sensitivities in the range 0.02–0.07M⊙. We also present a first analysis of the spectral line information included in the ALMAGAL setup, and its potential for future scientific studies. As an example, specific spectral lines (e.g., SiO and CH3CN) at ≈1000 au scales resolve the presence of multiple outflows in clusters and will help us to search for disk candidates around massive protostars. Moreover, the broad frequency bands provide information on the chemical richness of the different cluster members, which can be used to study the chemical evolution during the formation process of star clusters.more » « less
-
Abstract Infrasound data from arrays can be used to detect, locate, and quantify a variety of natural and anthropogenic sources from local to remote distances. However, many array processing methods use a single broad frequency range to process the data, which can lead to signals of interest being missed due to the choice of frequency limits or simultaneous clutter sources. We introduce a new open-source Python code that processes infrasound array data in multiple sequential narrow frequency bands using the least-squares approach. We test our algorithm on a few examples of natural sources (volcanic eruptions, mass movements, and bolides) for a variety of array configurations. Our method reduces the need to choose frequency limits for processing, which may result in missed signals, and it is parallelized to decrease the computational burden. Improvements of our narrow-band least-squares algorithm over broad-band least-squares processing include the ability to distinguish between multiple simultaneous sources if distinct in their frequency content (e.g., microbarom or surf vs. volcanic eruption), the ability to track changes in frequency content of a signal through time, and a decreased need to fine-tune frequency limits for processing. We incorporate a measure of planarity of the wavefield across the array (sigma tau, στ) as well as the ability to utilize the robust least trimmed squares algorithm to improve signal processing and insight into array performance. Our implementation allows for more detailed characterization of infrasound signals recorded at arrays that can improve monitoring and enhance research capabilities.more » « less
-
ALMA-IMF is an Atacama Large Millimeter/submillimeter Array (ALMA) Large Program designed to measure the core mass function (CMF) of 15 protoclusters chosen to span their early evolutionary stages. It further aims to understand their kinematics, chemistry, and the impact of gas inflow, accretion, and dynamics on the CMF. We present here the first release of the ALMA-IMF line data cubes (DR1), produced from the combination of two ALMA 12 m-array configurations. The data include 12 spectral windows, with eight at 1.3 mm and four at 3 mm. The broad spectral coverage of ALMA-IMF (∼6.7 GHz bandwidth coverage per field) hosts a wealth of simple atomic, molecular, ionised, and complex organic molecular lines. We describe the line cube calibration done by ALMA and the subsequent calibration and imaging we performed. We discuss our choice of calibration parameters and optimisation of the cleaning parameters, and we demonstrate the utility and necessity of additional processing compared to the ALMA archive pipeline. As a demonstration of the scientific potential of these data, we present a first analysis of the DCN (3–2) line. We find that DCN (3–2) traces a diversity of morphologies and complex velocity structures, which tend to be more filamentary and widespread in evolved regions and are more compact in the young and intermediate-stage protoclusters. Furthermore, we used the DCN (3–2) emission as a tracer of the gas associated with 595 continuum cores across the 15 protoclusters, providing the first estimates of the core systemic velocities and linewidths within the sample. We find that DCN (3–2) is detected towards a higher percentage of cores in evolved regions than the young and intermediate-stage protoclusters and is likely a more complete tracer of the core population in more evolved protoclusters. The full ALMA 12m-array cubes for the ALMA-IMF Large Program are provided with this DR1 release.more » « less
An official website of the United States government

