Abstract Automated processing of environmental data is hindered by the wide array of unit representations provided in the metadata of digital datasets. For example, gm/m2, g/m2, gm-2, g/m^2, g.m-2 and gramPerMeterSquared are all representations of a single complex unit that might be human-readable but are not machine-interpretable. Connectingad hocunits to a single unit concept in an ontology permits the identification of datasets sharing units and provides additional information regarding labels, definitions, dimensions and transformations provided in the ontology. Here we use successive string transformations to linkad hocunit representations to units in the QUDT ontology (e.g., unit: GM-PER-M2). Although only 896 of 7,110 distinct units in a corpus of ecological metadata from DataONE, the Environmental Data Initiative and the U.S. National Ecological Observatory Network were matched, 324,811 unit uses (instances) out of 355,057 of total unit uses were successfully mapped to QUDT units (91%). The resulting lookup table was used to enable a web service and R functions for adding annotation elements to Ecological Metadata Language documents.
more »
« less
Mapping EDI, NEON and DataONE units to the QUDT ontology, 2022
In the metadata of digital environmental datasets, automated processing is hindered by the wide variety of representations for unit that may be human-readable, but may not be unambiguous or machine-interpretable, (e.g., grams per square meter, gm/m2, g/m2, gm-2, g/m^2, g.m-2, g m-2 and gramPerMeterSquared). Matching disparate representations of the same unit into a single unit concept from an ontology assists with interpretation and reuse by providing a linkage to a complete unit definitions with label, description, dimensions. Datasets with shared units can be identified during searches, and are more suitable for automating analyses and potential transformation. This dataset contains data and code associated with a project to map units in ecological metadata collected between 2013 and 2022 by DataONE, the Environmental Data Initiative and the U.S. National Ecological Observatory Network to the QUDT ontology using successive string transformations. Data entities include a) raw metadata as received (355,057 unit instances); b) integrated raw data; c) substitution tables for string transformations; d) resulting lookup table for 896 distinct units matched to QUDT units; e) associated R code used for QUDT matching plus a web service and R functions for adding annotation elements to Ecological Metadata Language metadata documents. Using these substitutions and code, 91% of unit instances in the raw metadata could be matched to QUDT. Data and results are discussed in “Porter JH, M O’Brien, M Frants, S Earl, M Martin, C Laney. (in review) Using a Units Ontology to Annotate Pre-Existing Metadata. Submitted to Scientific Data.
more »
« less
- Award ID(s):
- 2224545
- PAR ID:
- 10659617
- Publisher / Repository:
- Environmental Data Initiative
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practice. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2Metadata’s capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS®and Stata®(SAS®and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process.more » « less
-
Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation called SDTA and embody in a language called SDTL. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how a data transformation program could be converted to other functionally equivalent programs, permitting code reuse and result reproducibility. We also illustrate the possibility of using SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.more » « less
-
null (Ed.)Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets.We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.more » « less
-
null (Ed.)Abstract The duality principle for group representations developed in Dutkay et al. (J Funct Anal 257:1133–1143, 2009), Han and Larson (Bull Lond Math Soc 40:685–695, 2008) exhibits a fact that the well-known duality principle in Gabor analysis is not an isolated incident but a more general phenomenon residing in the context of group representation theory. There are two other well-known fundamental properties in Gabor analysis: the biorthogonality and the fundamental identity of Gabor analysis. The main purpose of this this paper is to show that these two fundamental properties remain to be true for general projective unitary group representations. Moreover, we also present a general duality theorem which shows that that muti-frame generators meet super-frame generators through a dual commutant pair of group representations. Applying it to the Gabor representations, we obtain that $$\{\pi _{\Lambda }(m, n)g_{1} \oplus \cdots \oplus \pi _{\Lambda }(m, n)g_{k}\}_{m, n \in {\mathbb {Z}}^{d}}$$ { π Λ ( m , n ) g 1 ⊕ ⋯ ⊕ π Λ ( m , n ) g k } m , n ∈ Z d is a frame for $$L^{2}({\mathbb {R}}\,^{d})\oplus \cdots \oplus L^{2}({\mathbb {R}}\,^{d})$$ L 2 ( R d ) ⊕ ⋯ ⊕ L 2 ( R d ) if and only if $$\cup _{i=1}^{k}\{\pi _{\Lambda ^{o}}(m, n)g_{i}\}_{m, n\in {\mathbb {Z}}^{d}}$$ ∪ i = 1 k { π Λ o ( m , n ) g i } m , n ∈ Z d is a Riesz sequence, and $$\cup _{i=1}^{k} \{\pi _{\Lambda }(m, n)g_{i}\}_{m, n\in {\mathbb {Z}}^{d}}$$ ∪ i = 1 k { π Λ ( m , n ) g i } m , n ∈ Z d is a frame for $$L^{2}({\mathbb {R}}\,^{d})$$ L 2 ( R d ) if and only if $$\{\pi _{\Lambda ^{o}}(m, n)g_{1} \oplus \cdots \oplus \pi _{\Lambda ^{o}}(m, n)g_{k}\}_{m, n \in {\mathbb {Z}}^{d}}$$ { π Λ o ( m , n ) g 1 ⊕ ⋯ ⊕ π Λ o ( m , n ) g k } m , n ∈ Z d is a Riesz sequence, where $$\pi _{\Lambda }$$ π Λ and $$\pi _{\Lambda ^{o}}$$ π Λ o is a pair of Gabor representations restricted to a time–frequency lattice $$\Lambda $$ Λ and its adjoint lattice $$\Lambda ^{o}$$ Λ o in $${\mathbb {R}}\,^{d}\times {\mathbb {R}}\,^{d}$$ R d × R d .more » « less
An official website of the United States government
