Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We then evaluate the model on a new classification task for labeling computational notebook cells as stages in the data analysis process from data import to wrangling, exploration, modeling, and evaluation. We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplied heuristics and outperforms a suite of baselines. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest study of scientific code to date and find that notebooks which devote an higher fraction of code to the typically labor-intensive process of wrangling data in expectation exhibit decreased citation counts for corresponding papers. We also show significant differences between academic and non-academic notebooks, including that academic notebooks devote substantially more code to wrangling and exploring data, and less on modeling.
Automated Assessment of Quality of Jupyter Notebooks Using Artificial Intelligence and Big Code
We present in this paper an automated method to assess the quality of Jupyter notebooks. The quality of notebooks is assessed in terms of reproducibility and executability. Specifically, we automatically extract a number of expert-defined features for each notebook, perform a feature selection step, and then trained supervised binary classifiers to predict whether a notebook is reproducible and executable, respectively. We also experimented with semantic code embeddings to capture the notebooks' semantics. We have evaluated these methods on a dataset of 306,539 notebooks and achieved an F1 score of 0.87 for reproducibility and 0.96 for executability (using expert-defined features) and an F1 score of 0.81 for reproducibility and 0.78 for executability (using code embeddings). Our results suggest that semantic code embeddings can be used to determine with good performance the reproducibility and executability of Jupyter notebooks, and since they can be automatically derived, they have the advantage of no need for expert involvement to define features.
more »
« less
- Award ID(s):
- 1918751
- PAR ID:
- 10311570
- Date Published:
- Journal Name:
- The International FLAIRS Conference Proceedings
- Volume:
- 34
- Issue:
- 1
- ISSN:
- 2334-0762
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Jupyter Notebooks are an enormously popular tool for creating and narrating computational research projects. They also have enormous potential for creating reproducible scientific research artifacts. Capturing the complete state of a notebook has additional benefits; for instance, the notebook execution may be split between local and remote resources, where the latter may have more powerful processing capabilities or store large or access-limited data. There are several challenges for making notebooks fully reproducible when examined in detail. The notebook code must be replicated entirely, and the underlying Python runtime environments must be identical. More subtle problems arise in replicating referenced data, external library dependencies, and runtime variable states. This paper presents solutions to these problems using Juptyer’s standard extension mechanisms to create an archivable system state for a running notebook. We show that the overhead for these additional mechanisms, which involve interacting with the underlying Linux kernel, does not introduce substantial execution time overheads, demonstrating the approach’s feasibility.more » « less
-
Jupyter Notebooks are an enormously popular tool for creating and narrating computational research projects. They also have enormous potential for creating reproducible scientific research artifacts. Capturing the complete state of a notebook has additional benefits; for instance, the notebook execution may be split between local and remote resources, where the latter may have more powerful processing capabilities or store large or access-limited data. There are several challenges for making notebooks fully reproducible when examined in detail. The notebook code must be replicated entirely, and the underlying Python runtime environments must be identical. More subtle problems arise in replicating referenced data, external library dependencies, and runtime variable states. This paper presents solutions to these problems using Juptyer’s standard extension mechanisms to create an archivable system state for a running notebook. We show that the overhead for these additional mechanisms, which involve interacting with the underlying Linux kernel, does not introduce substantial execution time overheads, demonstrating the approach’s feasibility.more » « less
-
Current computational notebooks, such as Jupyter, are a popular tool for data science and analysis. However, they use a 1D list structure for cells that introduces and exacerbates user issues, such as messiness, tedious navigation, inefficient use of large screen space, performance of non-linear analyses, and presentation of non-linear narratives. To ameliorate these issues, we designed a prototype extension for Jupyter Notebooks that enables 2D organization of computational notebook cells into multiple columns. In this paper, we present two evaluative studies to determine whether such “2D computational notebooks” provide advantages over the current computational notebook structure. From these studies, we found empirical evidence that our multi-olumn 2D computational notebooks provide enhanced efficiency and usability. We also gathered design feedback which may inform future works. Overall, the prototype was positively received, with some users expressing a clear preference for 2D computational notebooks even at this early stage of development.more » « less
-
Current computational notebooks, such as Jupyter, are a popular tool for data science and analysis. However, they use a 1D list structure for cells that introduces and exacerbates user issues, such as messiness, tedious navigation, inefficient use of large screen space, performance of non-linear analyses, and presentation of non-linear narratives. To ameliorate these issues, we designed a prototype extension for Jupyter Notebooks that enables 2D organization of computational notebook cells into multiple columns. In this paper, we present two evaluative studies to determine whether such “2D computational notebooks” provide advantages over the current computational notebook structure. From these studies, we found empirical evidence that our multi-olumn 2D computational notebooks provide enhanced efficiency and usability. We also gathered design feedback which may inform future works. Overall, the prototype was positively received, with some users expressing a clear preference for 2D computational notebooks even at this early stage of development.more » « less