Structured data transformation algebra (SDTA) and its applications

Song, Jie; Alter, George; Jagadish, H V

doi:10.1007/s10619-022-07418-6

Citation Details

Structured data transformation algebra (SDTA) and its applications

Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation called SDTA and embody in a language called SDTL. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how a data transformation program could be converted to other functionally equivalent programs, permitting code reuse and result reproducibility. We also illustrate the possibility of using SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations. more »