Subtle Bugs Everywhere: Generating Documentation for Data Wrangling Code

Yang, Chenyang; Zhou, Shurui; Guo, Jin L.C.; Kästner, Christian

doi:10.1109/ASE51524.2021.9678520

Citation Details

Subtle Bugs Everywhere: Generating Documentation for Data Wrangling Code

Data scientists reportedly spend 60 to 80 percent of their time in their daily routines on data wrangling, i.e. cleaning data and extracting features. However, data wrangling code is often repetitive and error-prone to write. Moreover, it is easy to introduce subtle bugs when reusing and adopting existing code, which result not in crashes but reduce model quality. To support data scientists with data wrangling, we present a technique to generate interactive documentation for data wrangling code. We use (1) program synthesis techniques to automatically summarize data transformations and (2) test case selection techniques to purposefully select representative examples from the data based on execution information collected with tailored dynamic program analysis. We demonstrate that a JupyterLab extension with our technique can provide documentation for many cells in popular notebooks and find in a user study that users with our plugin are faster and more effective at finding realistic bugs in data wrangling code. more »

Award ID(s):: 1852260

PAR ID:: 10313199

Author(s) / Creator(s):: Yang, Chenyang; Zhou, Shurui; Guo, Jin L.C.; Kästner, Christian

Date Published:: 2021-11-17

Journal Name:: Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Los Alamitos, CA: IEEE Computer Society

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/ASE51524.2021.9678520

More Like this