skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data Jamboree: A Party of Open-Source Software Solving Real-World Data Science Problems
The evolving focus in statistics and data science education highlights the growing importance of computing. This paper presents the Data Jamboree, a live event that combines computational methods with traditional statistical techniques to address real-world data science problems. Participants, ranging from novices to experienced users, followed workshop leaders in using open-source tools like Julia, Python, and R to perform tasks such as data cleaning, manipulation, and predictive modeling. The Jamboree showcased the educational benefits of working with open data, providing participants with practical, hands-on experience. We compared the tools in terms of efficiency, flexibility, and statistical power, with Julia excelling in performance, Python in versatility, and R in statistical analysis and visualization. The paper concludes with recommendations for designing similar events to encourage collaborative learning and critical thinking in data science.  more » « less
Award ID(s):
2105571
PAR ID:
10596335
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
New England Statistical Society
Date Published:
Journal Name:
The New England Journal of Statistics in Data Science
ISSN:
2693-7166
Page Range / eLocation ID:
1 to 9
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Much of modern science takes place in a computational environment, and, increasingly, that environment is programmed using R, Python, or Julia. Furthermore, most scientific data now live on the cloud, so the first step in many workflows is to query a cloud database and load the response into a computational environment for further analysis. Thus, tools that facilitate programmatic data retrieval represent a critical component in reproducible scientific workflows. Earth science is no different in this regard. To fulfill that basic need, we developed R, Python, and Julia packages providing programmatic access to the U.S. Geological Survey’s National Water Information System database and the multi-agency Water Quality Portal. Together, these packages create a common interface for retrieving hydrologic data in the Jupyter ecosystem, which is widely used in water research, operations, and teaching. Source code, documentation, and tutorials for the packages are available on GitHub. Users can go there to learn, raise issues, or contribute improvements within a single platform, which helps foster better engagement and collaboration between data providers and their users. 
    more » « less
  2. Abstract ContextPractitioners prefer to achieve performance without sacrificing productivity when developing scientific software. The Julia programming language is designed to develop performant computer programs without sacrificing productivity by providing a syntax that is scripting in nature. According to the Julia programming language website, the common projects are data science, machine learning, scientific domains, and parallel computing. While Julia has yielded benefits with respect to productivity, programs written in Julia can include security weaknesses, which can hamper the security of Julia-based scientific software. A systematic derivation of security weaknesses can facilitate secure development of Julia programs—an area that remains under-explored. ObjectiveThe goal of this paper is to help practitioners securely develop Julia programs by conducting an empirical study of security weaknesses found in Julia programs. MethodWe apply qualitative analysis on 4,592 Julia programs used in 126 open-source Julia projects to identify security weakness categories. Next, we construct a static analysis tool calledJuliaStaticAnalysisTool (JSAT) that automatically identifies security weaknesses in Julia programs. We apply JSAT to automatically identify security weaknesses in 558 open-source Julia projects consisting of 25,008 Julia programs. ResultsWe identify 7 security weakness categories, which include the usage of hard-coded password and unsafe invocation. From our empirical study we identify 23,839 security weaknesses. On average, we observe 24.9% Julia source code files to include at least one of the 7 security weakness categories. ConclusionBased on our research findings, we recommend rigorous inspection efforts during code reviews. We also recommend further development and application of security static analysis tools so that security weaknesses in Julia programs can be detected before execution. 
    more » « less
  3. Statistical analysis is a crucial component of many data science analytic pipelines, and preparing data for such analysis is a large part of the data ingestion step. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation called SDTA and embody in a language called SDTL. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets. We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how a data transformation program could be converted to other functionally equivalent programs, permitting code reuse and result reproducibility. We also illustrate the possibility of using SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations. 
    more » « less
  4. null (Ed.)
    Statistical data manipulation is a crucial component of many data science analytic pipelines, particularly as part of data ingestion. This task is generally accomplished by writing transformation scripts in languages such as SPSS, Stata, SAS, R, Python (Pandas) and etc. The disparate data models, language representations and transformation operations supported by these tools make it hard for end users to understand and document the transformations performed, and for developers to port transformation code across languages. Tackling these challenges, we present a formal paradigm for statistical data transformation. It consists of a data model, called Structured Data Transformation Data Model (SDTDM), inspired by the data models of multiple statistical transformations frameworks; an algebra, Structural Data Transformation Algebra (SDTA), with the ability to transform not only data within SDTDM but also metadata at multiple structural levels; and an equivalent descriptive counterpart, called Structured Data Transformation Language (SDTL), recently adopted by the DDI Alliance that maintains international standards for metadata as part of its suite of products. Experiments with real statistical transformations on socio-economic data show that SDTL can successfully represent 86.1% and 91.6% respectively of 4,185 commands in SAS and 9,087 commands in SPSS obtained from a repository. We illustrate with examples how SDTA/SDTL could assist with the documentation of statistical data transformation, an important aspect often neglected in metadata of datasets.We propose a system called C2Metadata that automatically captures the transformation and provenance information in SDTL as a part of the metadata. Moreover, given the conversion mechanism from a source statistical language to SDTA/SDTL, we show how functional-equivalent transformation programs could be converted to other functionally equivalent programs, in the same or different language, permitting code reuse and result reproducibility, We also illustrate the possibility of using of SDTA to optimize SDTL transformations using rule-based rewrites similar to SQL optimizations. 
    more » « less
  5. Open data programs have become increasingly established at national and local levels of government. While the degree of success these programs have had in achieving their objectives remains open to question, one factor that has been identified as important to any success is the role of open data intermediaries, individuals and organizations that help others to make use of open data. In this paper we investigate how people become engaged with open data, what their motivations are, and the barriers and facilitators program participants perceive with regard to using open data effectively. We interview participants from a variety of backgrounds with differing levels of experience and engagement with open data. Participants include students learning how to train others in open data techniques and tools; people who attend open data events and use open data for commercial or social benefit; and representatives from local government, municipal agencies and a civic tech non-profit. We identify pathways to successfully developing and nurturing a community of open data intermediaries, and make five recommendations for organizations planning and managing open data programs. 
    more » « less