In the metadata of digital environmental datasets, automated processing is hindered by the wide variety of representations for unit that may be human-readable, but may not be unambiguous or machine-interpretable, (e.g., grams per square meter, gm/m2, g/m2, gm-2, g/m^2, g.m-2, g m-2 and gramPerMeterSquared). Matching disparate representations of the same unit into a single unit concept from an ontology assists with interpretation and reuse by providing a linkage to a complete unit definitions with label, description, dimensions. Datasets with shared units can be identified during searches, and are more suitable for automating analyses and potential transformation. This dataset contains data and code associated with a project to map units in ecological metadata collected between 2013 and 2022 by DataONE, the Environmental Data Initiative and the U.S. National Ecological Observatory Network to the QUDT ontology using successive string transformations. Data entities include a) raw metadata as received (355,057 unit instances); b) integrated raw data; c) substitution tables for string transformations; d) resulting lookup table for 896 distinct units matched to QUDT units; e) associated R code used for QUDT matching plus a web service and R functions for adding annotation elements to Ecological Metadata Language metadata documents. Using these substitutions and code, 91% of unit instances in the raw metadata could be matched to QUDT. Data and results are discussed in “Porter JH, M O’Brien, M Frants, S Earl, M Martin, C Laney. (in review) Using a Units Ontology to Annotate Pre-Existing Metadata. Submitted to Scientific Data.
more »
« less
Using a units ontology to annotate pre-existing metadata
Automated processing of environmental data is hindered by the wide array of unit representations provided in the metadata of digital datasets. For example, gm/m2, g/m2, gm-2, g/m^2, g.m-2 and gramPerMeterSquared are all representations of a single complex unit that might be human-readable but are not machine-interpretable. Connecting ad hoc units to a single unit concept in an ontology permits the identification of datasets sharing units and provides additional information regarding labels, definitions, dimensions and transformations provided in the ontology. Here we use successive string transformations to link ad hoc unit representations to units in the QUDT ontology (e.g., unit: GM-PER-M2). Although only 896 of 7,110 distinct units in a corpus of ecological metadata from DataONE, the Environmental Data Initiative and the U.S. National Ecological Observatory Network were matched, 324,811 unit uses (instances) out of 355,057 of total unit uses were successfully mapped to QUDT units (91%). The resulting lookup table was used to enable a web service and R functions for adding annotation elements to Ecological Metadata Language documents.
more »
« less
- PAR ID:
- 10586152
- Publisher / Repository:
- Nature Communications
- Date Published:
- Journal Name:
- Scientific Data
- Volume:
- 12
- Issue:
- 1
- ISSN:
- 2052-4463
- Format(s):
- Medium: X
- Associated Dataset(s):
- View Associated Dataset(s) >>
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Above ground plant, belowground stem and root biomass was measured in moist acidic tussock tundra experimental sites established in 2006 by the Arctic Long-term Ecological Research site (ARC-LTER. Control plots and plots amended with three different levels of nitrogen(N) and phosphorus(P), F10 (10 g/m2 N and 5 g/m2 P); F5 (5 g/m2 N and 2.5 g/m2 P); F2 (2 g/m2 N and 1 g/m2 P), were sampled.more » « less
-
The quality of data is extremely important for data analytics. Data quality tests typically involve checking constraints specified by domain experts. Existing approaches detect trivial constraint violations and identify outliers without explaining the constraints that were violated. Moreover, domain experts may specify constraints in an ad hoc manner and miss important ones. We describe an automated data quality test approach, ADQuaTe2, which uses an autoencoder to (1) discover constraints that may have been missed by experts, (2) label as suspicious those records that violate the constraints, and (3) provide explanations about the violations. An interactive learning technique incorporates expert feedback, which improves the accuracy. We evaluate the effectiveness of ADQuaTe2 on real-world datasets from health and plant domains. We also use datasets from the UCI repository to evaluate the improvement in the accuracy after incorporating ground truth knowledge.more » « less
-
null (Ed.)Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practice. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2Metadata’s capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS®and Stata®(SAS®and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process.more » « less
-
Among the sustainable initiatives for renewable energy technologies, anaerobic digestion (AD) is a potential contender to replace fossil fuels. The anaerobic co-digestions of goat manure (GM) with sorghum (SG), cotton gin trash (CGT), and food waste (FW) having different mixing ratios, volumes, temperatures, and additives were optimized in single and two-stage bioreactors. The biochemical methane potential assays (having different mixing ratios of double and triple substrates) were run in 250 mL serum bottles in triplicates. The best-yielding ratio was up-scaled to fabricated 2 L bioreactors. The biodegradability, biomethane recovery, and process efficacy are discussed. The co-digestion of GM with SG in a 70:30 ratio yielded the highest biomethane of 239.3 ± 15.6 mL/gvs, and it was further up-scaled to a two-stage temperature-phased process supplemented with an anaerobic medium and fly ash (FA) in fabricated 2 L bioreactors. This system yielded the highest biomethane of 266.0 mL/gvs, having an anaerobic biodegradability of 67.3% in 70:30 GM:SG co-digestion supplemented with an anaerobic medium. The BMP of the FA-amended treatment may be lower because of its high Ca concentration of 205.74 ± 3.6. The liquid fraction of the effluents can be applied as N and P fertigation. The Ca concentration was found to be 24.3, 25.1, and 6.3 g/kg in GM and GM:SG (TS) and SG solid fractions, respectively, whereas K was found to be 26.6, 10.8, and 7.4 g/kg. The carbon to nitrogen ratio of solid fraction varied between 2.0 and 24.8 for return to the soils to enhance its quality. This study involving feedstock acquisition, characterization, and their anaerobic digestion optimization provides comprehensive information and may assist small farmers operating on-farm anaerobic digesters.more » « less
An official website of the United States government

