null
(Ed.)
Statistical data manipulation is a crucial component of many data
science analytic pipelines, particularly as part of data ingestion. This
task is generally accomplished by writing transformation scripts in
languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The
disparate data models, language representations and transformation
operations supported by these tools make it hard for end users to
understand and document the transformations performed, and for
developers to port transformation code across languages.
Tackling these challenges, we present a formal paradigm for
statistical data transformation. It consists of a data model, called
Structured Data Transformation Data Model (SDTDM), inspired by
the data models of multiple statistical transformations frameworks;
an algebra, Structural Data Transformation Algebra (SDTA), with the
ability to transform not only data within SDTDM but also metadata
at multiple structural levels; and an equivalent descriptive counterpart,
called Structured Data Transformation Language (SDTL),
recently adopted by the DDI Alliance that maintains international
standards for metadata as part of its suite of products. Experiments
with real statistical transformations on socio-economic data show
that SDTL can successfully represent 86.1% and 91.6% respectively
of 4,185 commands in SAS and 9,087 commands in SPSS obtained
from a repository.
We illustrate with examples how SDTA/SDTL could assist with
the documentation of statistical data transformation, an important
aspect often neglected in metadata of datasets.We propose a system
called C2Metadata that automatically captures the transformation
and provenance information in SDTL as a part of the metadata.
Moreover, given the conversion mechanism from a source statistical
language to SDTA/SDTL, we show how functional-equivalent
transformation programs could be converted to other functionally
equivalent programs, in the same or different language, permitting
code reuse and result reproducibility, We also illustrate the possibility
of using of SDTA to optimize SDTL transformations using
rule-based rewrites similar to SQL optimizations.
more »
« less