Abstract The open data movement has brought revolutionary changes to the field of mineralogy. With a growing number of datasets made available through community efforts, researchers are now able to explore new scientific topics such as mineral ecology, mineral evolution and new classification systems. The recent results have shown that the necessary open data coupled with data science skills and expertise in mineralogy will lead to impressive new scientific discoveries. Yet, feedback from researchers also reflects the needs for better FAIRness of open data, that is, findable, accessible, interoperable and reusable for both humans and machines. In this paper, we present our recent work on building the open data service of Mindat, one of the largest mineral databases in the world. In the past years, Mindat has supported numerous scientific studies but a machine interface for data access has never been established. Through the OpenMindat project we have achieved solid progress on two activities: (1) cleanse data and improve data quality, and (2) build a data sharing platform and establish a machine interface for data query and access. We hope OpenMindat will help address the increasing data needs from researchers in mineralogy for an internationally recognized authoritative database that is fully compliant with the FAIR guiding principles and helps accelerate scientific discoveries.
more »
« less
PubDAS: A PUBlic Distributed Acoustic Sensing Datasets Repository for Geosciences
Abstract During the past few years, distributed acoustic sensing (DAS) has become an invaluable tool for recording high-fidelity seismic wavefields with great spatiotemporal resolutions. However, the considerable amount of data generated during DAS experiments limits their distribution with the broader scientific community. Such a bottleneck inherently slows down the pursuit of new scientific discoveries in geosciences. Here, we introduce PubDAS—the first large-scale open-source repository where several DAS datasets from multiple experiments are publicly shared. PubDAS currently hosts eight datasets covering a variety of geological settings (e.g., urban centers, underground mines, and seafloor), spanning from several days to several years, offering both continuous and triggered active source recordings, and totaling up to ∼90 TB of data. This article describes these datasets, their metadata, and how to access and download them. Some of these datasets have only been shallowly explored, leaving the door open for new discoveries in Earth sciences and beyond.
more »
« less
- Award ID(s):
- 2022716
- PAR ID:
- 10437086
- Date Published:
- Journal Name:
- Seismological Research Letters
- Volume:
- 94
- Issue:
- 2A
- ISSN:
- 0895-0695
- Page Range / eLocation ID:
- 983 to 998
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In the past decade, distributed acoustic sensing (DAS) has enabled many new monitoring applications in diverse fields including hydrocarbon exploration and extraction; induced, local, regional, and global seismology; infrastructure and urban monitoring; and several others. However, to date, the open-source software ecosystem for handling DAS data is relatively immature. Here we introduce DASCore, a Python library for analyzing, visualizing, and managing DAS data. DASCore implements an object-oriented interface for performing common data processing and transformations, reading and writing various DAS file types, creating simple visualizations, and managing file system-based DAS archives. DASCore also integrates with other Python-based tools which enable the processing of massive data sets in cloud environments. DASCore is the foundational package for the broader DAS data analysis ecosystem (DASDAE), and as such its main goal is to facilitate the development of other DAS libraries and applications.more » « less
-
Abstract Geolocalization of distributed acoustic sensing (DAS) array channels represents a crucial step whenever the technology is deployed in the field. Commonly, the geolocalization is performed using point-wise active-source experiments, known as tap tests, conducted in the vicinity of the recording fiber. However, these controlled-source experiments are time consuming and greatly diminish the ability to promptly deploy such systems, especially for large-scale DAS experiments. We present a geolocalization methodology for DAS instrumentation that relies on seismic signals generated by a geotracked vehicle. We demonstrate the efficacy of our workflow by geolocating the channels of two DAS systems recording data on dark fibers stretching approximately 100 km within the Long Valley caldera area in eastern California. Our procedure permits the prompt calibration of DAS channel locations for seismic-related applications such as seismic hazard assessment, urban-noise monitoring, wavespeed inversion, and earthquake engineering. We share the developed set of codes along with a tutorial guiding users through the entire mapping process.more » « less
-
New technologies have led to vast troves of large and complex data sets across many scientific domains and industries. People routinely use machine learning techniques not only to process, visualize, and make predictions from these big data, but also to make data-driven discoveries. These discoveries are often made using interpretable machine learning, or machine learning models and techniques that yield human-understandable insights. In this article, we discuss and review the field of interpretable machine learning, focusing especially on the techniques, as they are often employed to generate new knowledge or make discoveries from large data sets. We outline the types of discoveries that can be made using interpretable machine learning in both supervised and unsupervised settings. Additionally, we focus on the grand challenge of how to validate these discoveries in a data-driven manner, which promotes trust in machine learning systems and reproducibility in science. We discuss validation both from a practical perspective, reviewing approaches based on data-splitting and stability, as well as from a theoretical perspective, reviewing statistical results on model selection consistency and uncertainty quantification via statistical inference. Finally, we conclude byhighlighting open challenges in using interpretable machine learning techniques to make discoveries, including gaps between theory and practice for validating data-driven discoveries.more » « less
-
With ever-increasing volumes of scientific floating-point data being produced by high-performance computing applications, significantly reducing scientific floating-point data size is critical, and error-controlled lossy compressors have been developed for years. None of the existing scientific floating-point lossy data compressors, however, support effective fixed-ratio lossy compression. Yet fixed-ratio lossy compression for scientific floating-point data not only compresses to the requested ratio but also respects a user-specified error bound with higher fidelity. In this paper, we present FRaZ: a generic fixed-ratio lossy compression framework respecting user-specified error constraints. The contribution is twofold. (1) We develop an efficient iterative approach to accurately determine the appropriate error settings for different lossy compressors based on target compression ratios. (2) We perform a thorough performance and accuracy evaluation for our proposed fixed-ratio compression framework with multiple state-of-the-art error-controlled lossy compressors, using several real-world scientific floating-point datasets from different domains. Experiments show that FRaZ effectively identifies the optimum error setting in the entire error setting space of any given lossy compressor. While fixed-ratio lossy compression is slower than fixed-error compression, it provides an important new lossy compression technique for users of very large scientific floating-point datasets.more » « less