Computational notebooks promote exploration by structuring code, output, and explanatory text, into cells. The input code and rich outputs help users iteratively investigate ideas as they explore or analyze data. The links between these cells–how the cells depend on each other–are important in understanding how analyses have been developed and how the results can be reproduced. Specifically, a code cell that uses a particular identifier depends on the cell where that identifier is defined or mutated. Because notebooks promote fluid editing where cells can be moved and run in any order, cell dependencies are not always clear or easy to follow. We examine different tools that seek to address this problem by extending Jupyter notebooks and evaluate how well they support users in accomplishing tasks that require understanding dependencies. We also evaluate visualization techniques that provide views of the dependencies to help users navigate cell dependencies.
more »
« less
Reliable Data Use In R, 10.17605/OSF.IO/VKJ9Q
As R is becoming a standard research tool, a basic question remains: how to reliably reference data used in R programs? Almost any R user can sympathize with the problems of having local paths (e.g., read.csv("path/to/file.csv") ), and that URLs aren't much better (or worse when bandwidth is limiting). R developers have largely sidestepped this problem by packaging the data in the code, which has made datasets like "iris" and "mcars" famous. However, this approach is of little help to any real-world data. With help of code examples and R package "contentid", we show how to write R code that is agnostic to the location of data, works with local / remote data, and ensures that the requested data is used.
more »
« less
- Award ID(s):
- 1839201
- PAR ID:
- 10192238
- Date Published:
- Journal Name:
- 4th Annual Digital Data in Biodiversity Research, 1-3 June 2020
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Summary The need for efficient tools and applications for analyzing genomic diversity is essential for any genetics research program. One such tool, TASSEL (Trait Analysis by aSSociation, Evolution and Linkage), provides many core methods for genomic analyses. Despite its efficiency, TASSEL has limited means to use scripting languages for reproducible research and interacting with other analytical tools. Here we present an R package rTASSEL, a front-end to connect to a variety of highly used TASSEL methods and analytical tools. The goal of this package is to create a unified scripting workflow that exploits the analytical prowess of TASSEL in conjunction with R’s popular data handling and parsing capabilities without ever having the user to switch between these two environments. By implementing this workflow, we can achieve performances ranging from approximately 2 to 20 times faster than other widely used R packages for various functionalities. Availability and implementation rTASSEL is implemented in R using core TASSEL methods written in Java. The source code for rTASSEL can be found at https://bitbucket.org/bucklerlab/rtassel/src/master/. The source code for TASSEL can be found at https://bitbucket.org/tasseladmin/tassel-5-source/src/master/.more » « less
-
The R programming language is widely used for statistical computing. To enable interactive data exploration and rapid prototyping, R encourages a dynamic programming style. This programming style is supported by features such as first-class environments. Amongst widely used languages, R has the richest interface for programmatically manipulating environments. With the flexibility afforded by reflective operations on first-class environments, come significant challenges for reasoning and optimizing user-defined code. This paper documents the reflective interface used to operate over first-class environment. We explain the rationale behind its design and conduct a large-scale study of how the interface is used in popular libraries.more » « less
-
Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model’s accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices, but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.more » « less
-
A. Oh; T. Naumann; A. Globerson; K. Saenko; M. Hardt; S. Levine (Ed.)In the theory of lossy compression, the rate-distortion (R-D) function R(D) describes how much a data source can be compressed (in bit-rate) at any given level of fidelity (distortion). Obtaining R(D) for a given data source establishes the fundamental performance limit for all compression algorithms. We propose a new method to estimate R(D) from the perspective of optimal transport. Unlike the classic Blahut--Arimoto algorithm which fixes the support of the reproduction distribution in advance, our Wasserstein gradient descent algorithm learns the support of the optimal reproduction distribution by moving particles. We prove its local convergence and analyze the sample complexity of our R-D estimator based on a connection to entropic optimal transport. Experimentally, we obtain comparable or tighter bounds than state-of-the-art neural network methods on low-rate sources while requiring considerably less tuning and computation effort. We also highlight a connection to maximum-likelihood deconvolution and introduce a new class of sources that can be used as test cases with known solutions to the R-D problem.more » « less
An official website of the United States government

