skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: C 2 Metadata: Automating the Capture of Data Transformations from Statistical Scripts in Data Documentation
Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practice. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2Metadata’s capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS®and Stata®(SAS®and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process.  more » « less
Award ID(s):
1640575
PAR ID:
10298546
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 2019 International Conference on Management of Data
Page Range / eLocation ID:
2005 to 2008
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets.We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations. 
    more » « less
  2. null (Ed.)
    Structured Data Transformation Language (SDTL) provides structured, machine actionable representations of data transformation commands found in statistical analysis software.   The Continuous Capture of Metadata for Statistical Data Project (C2Metadata) created SDTL as part of an automated system that captures provenance metadata from data transformation scripts and adds variable derivations to standard metadata files.  SDTL also has potential for auditing scripts and for translating scripts between languages.  SDTL is expressed in a set of JSON schemas, which are machine actionable and easily serialized to other formats.  Statistical software languages have a number of special features that have been carried into SDTL.  We explain how SDTL handles differences among statistical languages and complex operations, such as merging files and reshaping data tables from “wide” to “long”. 
    more » « less
  3. Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation called SDTA and embody in a language called SDTL. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how a data transformation program could be converted to other functionally equivalent programs, permitting code reuse and result reproducibility. We also illustrate the possibility of using SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations. 
    more » « less
  4. This repository contains our raw datasets from channel measurements performed at the University of Utah campus. In addition, we have included a document that explains the setup and methodology used to collect this data, as well as a very brief discussion of results.  File organization: * documentation/ - Contains a .docx with the description of the setup and evaluation. * data/ - HDF5 files containing both metadata and raw IQ samples for each location at which data was collected. Notice we collected data at 14  different client locations. See map in the attached docx (skipped locations 12 and 16). We deployed 5 different receivers at 5 different rooftops. Due to resource constraints, one set of files contains data from 4 different locations whereas another set  contains information from the single remaining location. We have developed a set of python scripts that allow us to parse and analyze the data. Although not included here, they can be found in our public repository: https://github.com/renew-wireless/RENEWLab You can find the top script here.</p> For more information on the POWDER-RENEW project please visit the POWDER website. The RENEW part of the project focuses on the deployment of an open-source massive MIMO system. Please visit our website for more information.</p> 
    more » « less
  5. This is the data archive for: Meyer et al. 2022. Plant neighborhood shapes diversity and reduces interspecific variation of the phyllosphere microbiome. ISME-J. Please cite this article when using these archived data.</div>DOI: 10.1038/s41396-021-01184-6</div></div>Included are raw genetic sequences of the V5-V7 region of the 16S rRNA gene derived from experimental leaf surfaces of tomato, pepper, and bean plants.</div></div>Included in this archive are:</div>Raw sequence data (RawFASTQ.zip)</div>Reproducible R scripts (MeyerEtAl2021_RScript.R, VarPartSupplement.R)</div>R objects corresponding to archived scripts (.RDS)</div>Data for generating certain plots (PermanovaRValues.txt, PermanovaValuesByHost.txt, NeutralModelRValuesByHarvest.txt, VarPartHostEffects.txt)</div>Sample metadata (NeighborhoodMetaData.txt)</div>Phylogenetic Tree file for sample ASVs (PhyloTree.tre)</div>Geographic distance matrix for distances between plots (GeodistNeighborhood.txt)</div>ddPCR (microbial abundance) data (ddPCR_Neighborhood.csv)</div>R script for rarefication function (Rarefy_mean.R)</div>Taxonomic assignments for all ASVs in study (Taxonomy_Neighborhood.txt)</div>R image files to load R environment instead of running script (MeyerEtAl2021_RScript.RData, VarPartSupplement.RData)</div></div></div></div> 
    more » « less