skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Reproducibility Starts at the Source: R, Python, and Julia Packages for Retrieving USGS Hydrologic Data
Much of modern science takes place in a computational environment, and, increasingly, that environment is programmed using R, Python, or Julia. Furthermore, most scientific data now live on the cloud, so the first step in many workflows is to query a cloud database and load the response into a computational environment for further analysis. Thus, tools that facilitate programmatic data retrieval represent a critical component in reproducible scientific workflows. Earth science is no different in this regard. To fulfill that basic need, we developed R, Python, and Julia packages providing programmatic access to the U.S. Geological Survey’s National Water Information System database and the multi-agency Water Quality Portal. Together, these packages create a common interface for retrieving hydrologic data in the Jupyter ecosystem, which is widely used in water research, operations, and teaching. Source code, documentation, and tutorials for the packages are available on GitHub. Users can go there to learn, raise issues, or contribute improvements within a single platform, which helps foster better engagement and collaboration between data providers and their users.  more » « less
Award ID(s):
1931297
PAR ID:
10549842
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Water
Date Published:
Journal Name:
Water
Volume:
15
Issue:
24
ISSN:
2073-4441
Page Range / eLocation ID:
4236
Subject(s) / Keyword(s):
packaged workflows water data reproducibility open science open data open source R Python Julia Jupyter USGS
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This paper introduces a lidar and radar meteorology science gateway deployed on the NSF Jetstream2 cloud, designed to enhance educational and research activities in atmospheric science. Utilizing the "Zero to JupyterHub with Kubernetes" workflow, we have created a science gateway that integrates lidar and radar meteorology software packages, notably the Lidar Radar Open Software Environment (LROSE). This integration allows users to execute applications directly from the JupyterLab terminal, streamlining the creation of datasets for further anal- ysis and visualization within Jupyter notebooks. By combining traditional command-line operations with modern Python-based tools for data analysis and visualization, this gateway provides a robust end-to-end solution that caters to both educational and research needs. The gateway has already been vital in facilitating LROSE instructional workshops and will see future enhancements such as GPU acceleration to boost performance. Our work demonstrates the significant potential of merging established scientific computing techniques with advanced Python environments, opening new avenues for computational science education and research. 
    more » « less
  2. The evolving focus in statistics and data science education highlights the growing importance of computing. This paper presents the Data Jamboree, a live event that combines computational methods with traditional statistical techniques to address real-world data science problems. Participants, ranging from novices to experienced users, followed workshop leaders in using open-source tools like Julia, Python, and R to perform tasks such as data cleaning, manipulation, and predictive modeling. The Jamboree showcased the educational benefits of working with open data, providing participants with practical, hands-on experience. We compared the tools in terms of efficiency, flexibility, and statistical power, with Julia excelling in performance, Python in versatility, and R in statistical analysis and visualization. The paper concludes with recommendations for designing similar events to encourage collaborative learning and critical thinking in data science. 
    more » « less
  3. To improve accessibility and community knowledge of applications in the Lidar Radar Open Software Environment (LROSE), a team from the National Science Foundation (NSF) National Center for Atmospheric Research, Colorado State University, and NSF Unidata has developed a lidar and radar meteorology science gateway deployed on the NSF Jetstream2 cloud. Utilizing the “Zero to JupyterHub with Kubernetes” workflow, the science gateway integrates LROSE with other lidar and radar meteorology software packages. This integration allows users to execute applications directly from the JupyterLab terminal, streamlining the creation of datasets for further analysis and visualization within Jupyter notebooks. By combining traditional command-line operations with modern Python-based tools for data analysis and visualization, this gateway provides a robust end-to-end solution that caters to both educational and research needs. The gateway has already facilitated LROSE instructional workshops and classroom exercises. Our work demonstrates the significant potential of merging established scientific computing techniques with advanced Python environments, opening new avenues for computational science education and research. The LROSE team has acquired successive allocations on the NSF Jetstream2 cloud at Indiana University through ACCESS. To develop the LROSE Science Gateway, we employed the “Zero to JupyterHub with Kubernetes” workflow ported to the NSF Jetstream2 cloud, enabling rapid and scalable deployment to accommodate a variable number of users. Authentication is managed through either GitHub OAuth or temporary credentials, depending on the situation. Since LROSE is a collection of C/C++ applications, we configured Docker containers based on the Jupyter Docker Stack to integrate the LROSE software, available via the JupyterLab terminal. These containers also include Conda package manager environments equipped with Python packages like Py-ART, CSU RadarTools, and Metpy for further data analysis. A shared drive accessible to all participants contains instructional datasets for lidar and radar data analysis. Tutorials take the form of Jupyter notebooks for use by individuals, in classroom exercises, or at instructional workshops. Some tutorials are complete with pre-loaded examples to quickly visualize workflows and results. Other tutorials guide students how to run the applications independently. All tutorials are hosted on the LROSE Science Gateway GitHub repository, which is open to contributions from colleagues and community members. Future plans include an "intermediate" level workshop on SAMURAI, one of the multi-Doppler wind applications of the LROSE suite. Additionally, work is currently underway to run GUI applications in the same browser-based JupyterLab environment. GUI applications for radar and lidar data visualization utilize the QT framework and present unique technical challenges. The techniques to accomplish GUI access have immediate applications for other GUI programs, such as NSF Unidata's IDV and their version of the AWIPS CAVE data visualization tools. Lastly, as demand for the resources found on the gateway increases, it becomes increasingly important to efficiently manage the Jetstream2 resources allocated by the ACCESS program. LROSE, NSF Unidata, San Diego Supercomputing Center (SDSC), and Indiana University staff are working together to deploy and evaluate Kubernetes cluster auto-scaling. With auto-scaling, resources will no longer sit idle while awaiting new logins and will instead be provisioned on-demand. 
    more » « less
  4. Workflow management systems (WMSs) are commonly used to organize/automate sequences of tasks as workflows to accelerate scientific discoveries. During complex workflow modeling, a local interactive workflow environment is desirable, as users usually rely on their rich, local environments for fast prototyping and refinements before they consider using more powerful computing resources. However, existing WMSs do not simultaneously support local interactive workflow environments and HPC resources. In this paper, we present an on-demand access mechanism to remote HPC resources from desktop/laptopbased workflow management software to compose, monitor and analyze scientific workflows in the CyberWater project. Cyber- Water is an open-data and open-modeling software framework for environmental and water communities. In this work, we extend the open-model, open-data design of CyberWater with on-demand HPC accessing capacity. In particular, we design and implement the LaunchAgent library, which can be integrated into the local desktop environment to allow on-demand usage of remote resources for hydrology-related workflows. LaunchAgent manages authentication to remote resources, prepares the computationally-intensive or data-intensive tasks as batch jobs, submits jobs to remote resources, and monitors the quality of services for the users. LaunchAgent interacts seamlessly with other existing components in CyberWater, which is now able to provide advantages of both feature-rich desktop software experience and increased computation power through on-demand HPC/Cloud usage. In our evaluations, we demonstrate how a hydrology workflow that consists of both local and remote tasks can be constructed and show that the added on-demand HPC/Cloud usage helps speeding up hydrology workflows while allowing intuitive workflow configurations and execution using a desktop graphical user interface. 
    more » « less
  5. Abstract AQME, automated quantum mechanical environments, is a free and open‐source Python package for the rapid deployment of automated workflows using cheminformatics and quantum chemistry. AQME workflows integrate tasks performed across multiple computational chemistry packages and data formats, preserving all computational protocols, data, and metadata for machine and human users to access and reuse. AQME has a modular structure of independent modules that can be implemented in any sequence, allowing the users to use all or only the desired parts of the program. The code has been developed for researchers with basic familiarity with the Python programming language. The CSEARCH module interfaces to molecular mechanics and semi‐empirical QM (SQM) conformer generation tools (e.g., RDKit and Conformer–Rotamer Ensemble Sampling Tool, CREST) starting from various initial structure formats. The CMIN module enables geometry refinement with SQM and neural network potentials, such as ANI. The QPREP module interfaces with multiple QM programs, such as Gaussian, ORCA, and PySCF. The QCORR module processes QM results, storing structural, energetic, and property data while also enabling automated error handling (i.e., convergence errors, wrong number of imaginary frequencies, isomerization, etc.) and job resubmission. The QDESCP module provides easy access to QM ensemble‐averaged molecular descriptors and computed properties, such as NMR spectra. Overall, AQME provides automated, transparent, and reproducible workflows to produce, analyze and archive computational chemistry results. SMILES inputs can be used, and many aspects of tedious human manipulation can be avoided. Installation and execution on Windows, macOS, and Linux platforms have been tested, and the code has been developed to support access through Jupyter Notebooks, the command line, and job submission (e.g., Slurm) scripts. Examples of pre‐configured workflows are available in various formats, and hands‐on video tutorials illustrate their use. This article is categorized under:Data Science > ChemoinformaticsData Science > Computer Algorithms and ProgrammingSoftware > Quantum Chemistry 
    more » « less