skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An Object Storage for Distributed Acoustic Sensing
Abstract Large-scale processing and dissemination of distributed acoustic sensing (DAS) data are among the greatest computational challenges and opportunities of seismological research today. Current data formats and computing infrastructure are not well-adapted or user-friendly for large-scale processing. We propose an innovative, cloud-native solution for DAS seismology using the MinIO open-source object storage framework. We develop data schema for cloud-optimized data formats—Zarr and TileDB, which we deploy on a local object storage service compatible with the Amazon Web Services (AWS) storage system. We benchmark reading and writing performance for various data schema using canonical use cases in seismology. We test our framework on a local server and AWS. We find much-improved performance in compute time and memory throughout when using TileDB and Zarr compared to the conventional HDF5 data format. We demonstrate the platform with a computing heavy use case in seismology: ambient noise seismology of DAS data. We process one month of data, pairing all 2089 channels within 24 hr using AWS Batch autoscaling.  more » « less
Award ID(s):
2346079
PAR ID:
10541836
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
GeoscienceWorld
Date Published:
Journal Name:
Seismological Research Letters
Volume:
95
Issue:
1
ISSN:
0895-0695
Page Range / eLocation ID:
499 to 511
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract This article introduces a general processing framework to effectively utilize waveform data stored on modern cloud platforms. The focus is hybrid processing schemes for which a local system drives processing. We show that downloading files and doing all processing locally is problematic even when the local system is a high-performance computing (HPC) cluster. Benchmark tests with parallel processing show that approach always creates a bottleneck as the volume of data being handled increases with more processes pulling data. We find a hybrid model for which processing to reduce the volume of data transferred from the cloud servers to the local system can dramatically improve processing time. Tests implemented with the Massively Parallel Analysis System for Seismology (MsPASS) utilizing Amazon Web Service’s (AWS) Lambda service yield throughput comparable to processing day files on a local HPC file system. Given the ongoing migration of seismology data to cloud storage, our results show doing some or all processing on the cloud will be essential for any processing involving large volumes of data. 
    more » « less
  2. SUMMARY Seismology has entered the petabyte era, driven by decades of continuous recordings of broad-band networks, the increase in nodal seismic experiments and the recent emergence of distributed acoustic sensing (DAS). This review explains how cloud platforms, by providing object storage, elastic compute and managed data bases, enable researchers to ‘bring the code to the data,’ thereby providing a scalable option to overcome traditional HPC solutions’ bandwidth and capacity limitations. After literature reviews of cloud concepts and their research applications in seismology, we illustrate the capacities of cloud-native workflows using two canonical end-to-end demonstrations: (1) ambient noise seismology that calculates cross-correlation functions at scale, and (2) earthquake detection and phase picking. Both workflows utilize Amazon Web Services, a commercial cloud platform for streaming I/O and provenance, demonstrating that cloud throughput can rival on-premises HPC at comparable costs, scanning 100 TBs to 1.3 PBs of seismic data in a few hours or days of processing. The review also discusses research and education initiatives, the reproducibility benefits of containers and cost pitfalls (e.g. egress, I/O fees) of energy-intensive seismological research computing. While designing cloud pipelines remains non-trivial, partnerships with research software engineers enable converting domain code into scalable, automated and environmentally conscious solutions for next-generation seismology. We also outline where cloud resources fall short of specialized HPC—most notably for tightly coupled petascale simulations and long-term, PB-scale archives—so that practitioners can make informed, cost-effective choices. 
    more » « less
  3. In the past decade, distributed acoustic sensing (DAS) has enabled many new monitoring applications in diverse fields including hydrocarbon exploration and extraction; induced, local, regional, and global seismology; infrastructure and urban monitoring; and several others. However, to date, the open-source software ecosystem for handling DAS data is relatively immature. Here we introduce DASCore, a Python library for analyzing, visualizing, and managing DAS data. DASCore implements an object-oriented interface for performing common data processing and transformations, reading and writing various DAS file types, creating simple visualizations, and managing file system-based DAS archives. DASCore also integrates with other Python-based tools which enable the processing of massive data sets in cloud environments. DASCore is the foundational package for the broader DAS data analysis ecosystem (DASDAE), and as such its main goal is to facilitate the development of other DAS libraries and applications. 
    more » « less
  4. Abstract With the rise of data volume and computing power, seismological research requires more advanced skills in data processing, numerical methods, and parallel computing. We present the experience of conducting training workshops in various forms of delivery to support the adoption of large-scale high-performance computing (HPC) and cloud computing, advancing seismological research. The seismological foci were on earthquake source parameter estimation in catalogs, forward and adjoint wavefield simulations in 2D and 3D at local, regional, and global scales, earthquake dynamics, ambient noise seismology, and machine learning. This contribution describes the series of workshops delivered as part of research projects, the learning outcomes for participants, and lessons learned by the instructors. Our curriculum was grounded on open and reproducible science, large-scale scientific computing and data mining, and computing infrastructure (access and usage) for HPC and the cloud. We also describe the types of teaching materials that have proven beneficial to the instruction and the sustainability of the program. We propose guidelines to deliver future workshops on these topics. 
    more » « less
  5. Internet-scale web applications are becoming increasingly storage-intensive and rely heavily on in-memory object caching to attain required I/O performance. We argue that the emerging serverless computing paradigm provides a well-suited, cost-effective platform for object caching. We present InfiniCache, a first-of-its-kind in-memory object caching system that is completely built and deployed atop ephemeral serverless functions. InfiniCache exploits and orchestrates serverless functions' memory resources to enable elastic pay-per-use caching. InfiniCache's design combines erasure coding, intelligent billed duration control, and an efficient data backup mechanism to maximize data availability and cost-effectiveness while balancing the risk of losing cached state and performance. We implement InfiniCache on AWS Lambda and show that it: (1) achieves 31 – 96× tenant-side cost savings compared to AWS ElastiCache for a large-object-only production workload, (2) can effectively provide 95.4% data availability for each one hour window, and (3) enables comparative performance seen in a typical in-memory cache. 
    more » « less