skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Structured data transformation algebra (SDTA) and its applications
Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation called SDTA and embody in a language called SDTL. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how a data transformation program could be converted to other functionally equivalent programs, permitting code reuse and result reproducibility. We also illustrate the possibility of using SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations.  more » « less
Award ID(s):
1640575
PAR ID:
10514927
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Nature
Date Published:
Journal Name:
Distributed and Parallel Databases
Volume:
40
Issue:
2-3
ISSN:
0926-8782
Page Range / eLocation ID:
373 to 408
Subject(s) / Keyword(s):
data transformation algebra statistical data data documentation
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets.We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations. 
    more » « less
  2. null (Ed.)
    Structured Data Transformation Language (SDTL) provides structured, machine actionable representations of data transformation commands found in statistical analysis software.   The Continuous Capture of Metadata for Statistical Data Project (C2Metadata) created SDTL as part of an automated system that captures provenance metadata from data transformation scripts and adds variable derivations to standard metadata files.  SDTL also has potential for auditing scripts and for translating scripts between languages.  SDTL is expressed in a set of JSON schemas, which are machine actionable and easily serialized to other formats.  Statistical software languages have a number of special features that have been carried into SDTL.  We explain how SDTL handles differences among statistical languages and complex operations, such as merging files and reshaping data tables from “wide” to “long”. 
    more » « less
  3. null (Ed.)
    Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practice. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2Metadata’s capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS®and Stata®(SAS®and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process. 
    more » « less
  4. Although language-family specific traits which do not find direct counterparts outside a given language family are usually ignored in quantitative phylogenetic studies, scholars have made ample use of them in qualitative investigations, revealing their potential for identifying language relationships. An example of such a family specific trait are body-part expressions in Pano languages, which are often lexicalized forms, composed of bound roots (also called body-part prefixes in the literature) and non-productive derivative morphemes (called here body-part formatives). We use various statistical methods to demonstrate that whereas body-part roots are generally conservative, body-part formatives exhibit diverse chronologies and are often the result of recent and parallel innovations. In line with this, the phylogenetic structure of body-part roots projects the major branches of the family, while formatives are highly non-tree-like. Beyond its contribution to the phylogenetic analysis of Pano languages, this study provides significative insights into the role of grammatical innovations for language classification, the origin of morphological complexity in the Amazon and the phylogenetic signal of specific grammatical traits in language families. 
    more » « less
  5. In recent years, the FAIR guiding principles and the broader concept of open science has grown in importance in academic research, especially as funding entities have aggressively promoted public sharing of research products. Key to public research sharing is deposition of datasets into online data repositories, but it can be a chore to transform messy unstructured data into the forms required by these repositories. To help generate Metabolomics Workbench depositions, we have developed the MESSES (Metadata from Experimental SpreadSheets Extraction System) software package, implemented in the Python 3 programming language and supported on Linux, Windows, and Mac operating systems. MESSES helps transform tabular data from multiple sources into a Metabolomics Workbench specific deposition format. The package provides three commands, extract, validate, and convert, that implement a natural data transformation workflow. Moreover, MESSES facilitates richer metadata capture than is typically attempted by manual efforts. The source code and extensive documentation is hosted on GitHub and is also available on the Python Package Index for easy installation. 
    more » « less