skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MsPASS: A Data Management and Processing Framework for Seismology
This article introduces a new framework for seismic data processing and management we call the Massive Parallel Analysis System for Seismologists (MsPASS). The framework was designed to enable new scientific frontiers in seismology by providing a means to more effectively utilize massively parallel computers to handle the increasingly large data volume available today. MsPASS leverages several existing technologies: (1) scalable parallel processing frameworks, (2) NoSQL database management system, and (3) containers. The system leans heavily on the widely used ObsPy toolkit. It automates many database operations and provides a mechanism to automatically save the processing history for reproducibility. The synthesis of these components can provide flexibility to adapt to a wide range of data processing workflows. We demonstrate the system with a basic data processing workflow applied to USArray data. Through extensive documentation and examples, we aim to make this system a sustainable, open‐source framework for the community.  more » « less
Award ID(s):
1931352
PAR ID:
10391685
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Seismological research letters
Volume:
93
Issue:
1
ISSN:
0895-0695
Page Range / eLocation ID:
426-434
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Nested queries are commonly used to express complex use-cases by connecting the output of a subquery as an input to the outer query block. However, their execution is highly time-consuming. Researchers have proposed various algorithms and techniques that unnest subqueries to improve performance. Since this is a customized approach that needs high algorithmic and engineering efforts, it is largely not an open feature in most existing database systems.Our approach is general-purpose and GPU-acceleration based, aiming for high performance at a minimum development cost. We look into the major differences between nested and unnested query structures to identify their merits and limits for GPU processing. Furthermore, we focus on the nested approach that is algorithmically simple and rich in parallels, in relatively low space complexity, and generic in program structure. We create a new code generation framework that best fits GPU for the nested method. We also make several critical system optimizations including massive parallel scanning with indexing, effective vectorization to optimize join operations, exploiting cache locality for loops and efficient GPU memory management. We have implemented the proposed solutions in NestGPU, a GPU-based column-store database system that is GPU device independent. We have extensively evaluated and tested the system to show the effectiveness of our proposed methods. 
    more » « less
  2. null (Ed.)
    Nested queries are commonly used to express complex use-cases by connecting the output of a subquery as an input to the outer query block. However, their execution is highly time consuming. Researchers have proposed various algorithms and techniques that unnest subqueries to improve performance. Since this is a customized approach that needs high algorithmic and engineering efforts, it is largely not an open feature in most existing database systems. Our approach is general-purpose and GPU-acceleration based, aiming for high performance at a minimum development cost. We look into the major differences between nested and unnested query structures to identify their merits and limits for GPU processing. Furthermore, we focus on the nested approach that is algorithmically simple and rich in parallels, in relatively low space complexity, and generic in program structure. We create a new code generation framework that best fits GPU for the nested method. We also make several critical system optimizations including massive parallel scanning with indexing, effective vectorization to optimize join operations, exploiting cache locality for loops and efficient GPU memory management. We have implemented the proposed solutions in NestGPU, a GPU-based column-store database system that is GPU device independent. We have extensively evaluated and tested the system to show the effectiveness of our proposed methods. 
    more » « less
  3. Compliance with data retention laws and legislation is an important aspect of data management. As new laws governing personal data management are introduced (e.g., California Consumer Privacy Act enacted in 2020) and a greater emphasis is placed on enforcing data privacy law compliance, data retention support must be an inherent part of data management systems. However, relational databases do not currently offer functionality to enforce retention compliance. In this paper, we propose a framework that integrates data retention support into any relational database. Using SQL-based mechanisms, our system supports an intuitive definition of data retention policies. We demonstrate that our approach meets the legal requirements of retention and can be implemented to transparently guarantee compliance. Our framework streamlines compliance support without requiring database schema changes, while incurring an average 6.7% overhead compared to the current state-of-the-art solution. 
    more » « less
  4. null (Ed.)
    Relational database management systems (RDBMS) have limited iterative processing support. Recursive queries were added to ANSI SQL, however, their semantics do not allow aggregation functions, which disqualifies their use for several applications, such as PageRank and shortest path computations. Recently, another SQL extension, iterative Common Table Expressions (CTEs), is proposed to enable users to perform general iterative computations on RDBMSs.In this work 1 , we demonstrate how iterative CTEs can be efficiently incorporated into a production RDBMS without major intrusion to the system. We have prototyped our approach on Futurewei's MPPDB, a shared nothing relational parallel database engine. The implementation is based on a functional rewrite that translates iterative CTEs to other existing SQL operators. Thus, query plans of iterative CTEs can be optimized and executed by the engine with minimal modification to the code base. We have also applied several optimizations specifically for iterative CTEs to i) minimize data movement, ii) reuse results that remain constant and iii) push down predicates to avoid unnecessary data processing. We verified our implementation through extensive experimental evaluation using real world datasets and queries. The results show the feasibility of the rewrite approach and the effectiveness of the optimizations, which improve performance by an order of magnitude in some cases. 
    more » « less
  5. Abstract This article introduces a general processing framework to effectively utilize waveform data stored on modern cloud platforms. The focus is hybrid processing schemes for which a local system drives processing. We show that downloading files and doing all processing locally is problematic even when the local system is a high-performance computing (HPC) cluster. Benchmark tests with parallel processing show that approach always creates a bottleneck as the volume of data being handled increases with more processes pulling data. We find a hybrid model for which processing to reduce the volume of data transferred from the cloud servers to the local system can dramatically improve processing time. Tests implemented with the Massively Parallel Analysis System for Seismology (MsPASS) utilizing Amazon Web Service’s (AWS) Lambda service yield throughput comparable to processing day files on a local HPC file system. Given the ongoing migration of seismology data to cloud storage, our results show doing some or all processing on the cloud will be essential for any processing involving large volumes of data. 
    more » « less