skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Materia: A Data Quality Control Embedded Domain Specific Language in Python
Current solutions for data quality control (QC) in the environmental sciences are locked within propriety platforms or reliant on specialized software. This can pose a problem for data users when attempting to integrate QC into their existing workflows. To address this limitation, we developed an embedded domain specific language (EDSL), Materia, that provides functions, data structures, and a fluent syntax for defining and executing quality control tests on data. Materia enables developers to more easily integrate QC into complex data pipelines and makes QC more accessible for students and citizen scientists. We evaluate Materia via two metrics: productivity and a quantitative performance analysis. Our productivity examples show how Materia can simplify complex descriptions of tests in Pandas and mirror natural language descriptions of common QC tests. We also demonstrate that Materia achieves satisfactory performance with over 200,000 floating-point values processed in under three seconds.  more » « less
Award ID(s):
1656958
PAR ID:
10285530
Author(s) / Creator(s):
Date Published:
Journal Name:
BIS 2020: Business Information Systems Workshops
Page Range / eLocation ID:
285-296
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Large language models (LLMs), such as GPT-3 and GPT-4, have demonstrated exceptional performance in various natural language processing tasks and have shown the ability to solve certain reasoning problems. However, their reasoning capabilities are limited and relatively shallow, despite the application of various prompting techniques. In contrast, formal logic is adept at handling complex reasoning, but translating natural language descriptions into formal logic is a challenging task that non-experts struggle with. This paper proposes a neuro-symbolic method that combines the strengths of large language models and answer set programming. Specifically, we employ an LLM to transform natural language descriptions of logic puzzles into answer set programs. We carefully design prompts for an LLM to convert natural language descriptions into answer set programs in a step by step manner. Surprisingly, with just a few in-context learning examples, LLMs can generate reasonably complex answer set programs. The majority of errors made are relatively simple and can be easily corrected by humans, thus enabling LLMs to effectively assist in the creation of answer set programs. 
    more » « less
  2. The Southern Ocean Carbon and Climate Observations and Modeling (SOCCOM) project has deployed 194 profiling floats equipped with biogeochemical (BGC) sensors, making it one of the largest contributors to global BGC-Argo. Post-deployment quality control (QC) of float-based oxygen, nitrate, and pH data is a crucial step in the processing and dissemination of such data, as in situ chemical sensors remain in early stages of development. In situ calibration of chemical sensors on profiling floats using atmospheric reanalysis and empirical algorithms can bring accuracy to within 3 μmol O 2 kg –1 , 0.5 μmol NO 3 – kg –1 , and 0.007 pH units. Routine QC efforts utilizing these methods can be conducted manually through visual inspection of data to assess sensor drifts and offsets, but more automated processes are preferred to support the growing number of BGC floats and reduce subjectivity among delayed-mode operators. Here we present a methodology and accompanying software designed to easily visualize float data against select reference datasets and assess QC adjustments within a quantitative framework. The software is intended for global use and has been used successfully in the post-deployment calibration and QC of over 250 BGC floats, including all floats within the SOCCOM array. Results from validation of the proposed methodology are also presented which help to verify the quality of the data adjustments through time. 
    more » « less
  3. Millions of in situ ocean temperature profiles have been collected historically using various instrument types with varying sensor accuracy and then assembled into global databases. These are essential to our current understanding of the changing state of the oceans, sea level, Earth’s climate, marine ecosystems and fisheries, and for constraining model projections of future change that underpin mitigation and adaptation solutions. Profiles distributed shortly after collection are also widely used in operational applications such as real-time monitoring and forecasting of the ocean state and weather prediction. Before use in scientific or societal service applications, quality control (QC) procedures need to be applied to flag and ultimately remove erroneous data. Automatic QC (AQC) checks are vital to the timeliness of operational applications and for reducing the volume of dubious data which later require QC processing by a human for delayed mode applications. Despite the large suite of evolving AQC checks developed by institutions worldwide, the most effective set of AQC checks was not known. We have developed a framework to assess the performance of AQC checks, under the auspices of the International Quality Controlled Ocean Database (IQuOD) project. The IQuOD-AQC framework is an open-source collaborative software infrastructure built in Python (available from https://github.com/IQuOD ). Sixty AQC checks have been implemented in this framework. Their performance was benchmarked against three reference datasets which contained a spectrum of instrument types and error modes flagged in their profiles. One of these (a subset of the Quality-controlled Ocean Temperature Archive (QuOTA) dataset that had been manually inspected for quality issues by its creators) was also used to identify optimal sets of AQC checks. Results suggest that the AQC checks are effective for most historical data, but less so in the case of data from Mechanical Bathythermographs (MBTs), and much less effective for Argo data. The optimal AQC sets will be applied to generate quality flags for the next release of the IQuOD dataset. This will further elevate the quality and historical value of millions of temperature profile data which have already been improved by IQuOD intelligent metadata and observational uncertainty information ( https://doi.org/10.7289/v51r6nsf ). 
    more » « less
  4. Language model (LM) prompting—a popular paradigm for solving NLP tasks—has been shown to be susceptible to miscalibration and brittleness to slight prompt variations, caused by its discriminative prompting approach, i.e., predicting the label given the input. To address these issues, we propose Gen-Z—a generative prompting framework for zero-shot text classification. GEN-Z is generative, as it measures the LM likelihood of input text, conditioned on natural language descriptions of labels. The framework is multivariate, as label descriptions allow us to seamlessly integrate additional contextual information about the labels to improve task performance. On various standard classification benchmarks, with six open-source LM families, we show that zero-shot classification with simple contextualization of the data source of the evaluation set consistently outperforms both zero-shot and few-shot baselines while improving robustness to prompt variations. Further, our approach enables personalizing classification in a zero-shot manner by incorporating author, subject, or reader information in the label descriptions. 
    more » « less
  5. The self-assembly of shape-anisotropic nanocrystals into large-scale structures is a versatile and scalable approach to creating multifunctional materials. The tetrahedral geometry is ubiquitous in natural and manmade materials, yet regular tetrahedra present a formidable challenge in understanding their self-assembly behavior as they do not tile space. Here, we report diverse supracrystals from gold nanotetrahedra including the quasicrystal (QC) and the dimer packing predicted more than a decade ago and hitherto unknown phases. We solve the complex three-dimensional (3D) structure of the QC by a combination of electron microscopy, tomography, and synchrotron X-ray scattering. Nanotetrahedron vertex sharpness, surface ligands, and assembly conditions work in concert to regulate supracrystal structure. We also discover that the surface curvature of supracrystals can induce structural changes of the QC tiling and eventually, for small supracrystals with high curvature, stabilize a hexagonal approximant. Our findings bridge the gap between computational design and experimental realization of soft matter assemblies and demonstrate the importance of accurate control over nanocrystal attributes and the assembly conditions to realize increasingly complex nanopolyhedron supracrystals. 
    more » « less