Materia: A Data Quality Control Embedded Domain Specific Language in Python

Scully-Allison, Connor

doi:10.1007/978-3-030-61146-0_23

Citation Details

Materia: A Data Quality Control Embedded Domain Specific Language in Python

Current solutions for data quality control (QC) in the environmental sciences are locked within propriety platforms or reliant on specialized software. This can pose a problem for data users when attempting to integrate QC into their existing workflows. To address this limitation, we developed an embedded domain specific language (EDSL), Materia, that provides functions, data structures, and a fluent syntax for defining and executing quality control tests on data. Materia enables developers to more easily integrate QC into complex data pipelines and makes QC more accessible for students and citizen scientists. We evaluate Materia via two metrics: productivity and a quantitative performance analysis. Our productivity examples show how Materia can simplify complex descriptions of tests in Pandas and mirror natural language descriptions of common QC tests. We also demonstrate that Materia achieves satisfactory performance with over 200,000 floating-point values processed in under three seconds. more »

Award ID(s):: 1656958

PAR ID:: 10285530

Author(s) / Creator(s):: Scully-Allison, Connor

Date Published:: 2020-11-12

Journal Name:: BIS 2020: Business Information Systems Workshops

Page Range / eLocation ID:: 285-296

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1007/978-3-030-61146-0_23

More Like this