skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on January 1, 2026

Title: Toward reproducible and interoperable environmental modeling: Integration of HydroShare with server-side methods for exposing large-extent spatial datasets to models
Reproducible environmental modelling often relies on spatial datasets as inputs, typically manually subset for specific areas. Yet, models can benefit from a data distribution approach facilitated by online repositories, and automating processes to foster reproducibility. This study introduces a method leveraging diverse state-scale spatial datasets to create cohesive packages for GIS-based environmental modelling. These datasets were generated and shared via GeoServer and THREDDS Data Server Connected to HydroShare, contrasting with conventional distribution methods. Using the Regional Hydro-Ecologic Simulation System (RHESSys) across three U.S. catchment-scale watersheds, we demonstrate minimal errors in spatial inputs and model streamflow outputs compared to traditional approaches. This spatial data-sharing method facilitates consistent model creation, fostering reproducibility. Its broader impact allows scientists to tailor the method to various use cases, such as exploring different scales beyond state-scale or applying it to other online repositories using existing data distribution systems, eliminating the need to develop their own.  more » « less
Award ID(s):
1664061 2118329
PAR ID:
10555226
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
Environmental Modelling & Software
Volume:
183
Issue:
C
ISSN:
1364-8152
Page Range / eLocation ID:
106239
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Reporting specific modelling methods and metadata is essential to the reproducibility of ecological studies, yet guidelines rarely exist regarding what information should be noted. Here, we address this issue for ecological niche modelling or species distribution modelling, a rapidly developing toolset in ecology used across many aspects of biodiversity science. Our quantitative review of the recent literature reveals a general lack of sufficient information to fully reproduce the work. Over two-thirds of the examined studies neglected to report the version or access date of the underlying data, and only half reported model parameters. To address this problem, we propose adopting a checklist to guide studies in reporting at least the minimum information necessary for ecological niche modelling reproducibility, offering a straightforward way to balance efficiency and accuracy. We encourage the ecological niche modelling community, as well as journal reviewers and editors, to utilize and further develop this framework to facilitate and improve the reproducibility of future work. The proposed checklist framework is generalizable to other areas of ecology, especially those utilizing biodiversity data, environmental data and statistical modelling, and could also be adopted by a broader array of disciplines. 
    more » « less
  2. Arthropods contribute importantly to ecosystem functioning but remain understudied. This undermines the validity of conservation decisions. Modern methods are now making arthropods easier to study, since arthropods can be mass-trapped, mass-identified, and semi-mass-quantified into ‘many-row (observation), many-column (species)‘ datasets, with homogeneous error, high resolution, and copious environmental-covariate information. These ‘novel community datasets’ let us efficiently generate information on arthropod species distributions, conservation values, uncertainty, and the magnitude and direction of human impacts. We use a DNA-based method (barcode mapping) to produce an arthropod-community dataset from 121 Malaise-trap samples, and combine it with 29 remote-imagery layers using a deep neural net in a joint species distribution model. With this approach, we generate distribution maps for 76 arthropod species across a 225 km2temperate-zone forested landscape. We combine the maps to visualize the fine-scale spatial distributions of species richness, community composition, and site irreplaceability. Old-growth forests show distinct community composition and higher species richness, and stream courses have the highest site-irreplaceability values. With this ‘sideways biodiversity modelling’ method, we demonstrate the feasibility of biodiversity mapping at sufficient spatial resolution to inform local management choices, while also being efficient enough to scale up to thousands of square kilometres. This article is part of the theme issue ‘Towards a toolkit for global insect biodiversity monitoring’. 
    more » « less
  3. In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for—and levies criticisms at—data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these benchmark data repositories and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning. 
    more » « less
  4. Aldrich, Jonathan; Salvaneschi, Guido (Ed.)
    Large-scale software repositories are a source of insights for software engineering. They offer an unmatched window into the software development process at scale. Their sheer number and size holds the promise of broadly applicable results. At the same time, that very size presents practical challenges for scaling tools and algorithms to millions of projects. A reasonable approach is to limit studies to representative samples of the population of interest. Broadly applicable conclusions can then be obtained by generalizing to the entire population. The contribution of this paper is a standardized experimental design methodology for choosing the inputs of studies working with large-scale repositories. We advocate for a methodology that clearly lays out what the population of interest is, how to sample it, and that fosters reproducibility. Along the way, we discourage researchers from using extrinsic attributes of projects such as stars, that measure some unclear notion of popularity. 
    more » « less
  5. Abstract Integrated hydrological modeling is an effective method for understanding interactions between parts of the hydrologic cycle, quantifying water resources, and furthering knowledge of hydrologic processes. However, these models are dependent on robust and accurate datasets that physically represent spatial characteristics as model inputs. This study evaluates multiple data‐driven approaches for estimating hydraulic conductivity and subsurface properties at the continental‐scale, constructed from existing subsurface dataset components. Each subsurface configuration represents upper (unconfined) hydrogeology, lower (confined) hydrogeology, and the presence of a vertical flow barrier. Configurations are tested in two large‐scale U.S. watersheds using an integrated model. Model results are compared to observed streamflow and steady state water table depth (WTD). We provide model results for a range of configurations and show that both WTD and surface water partitioning are important indicators of performance. We also show that geology data source, total subsurface depth, anisotropy, and inclusion of a vertical flow barrier are the most important considerations for subsurface configurations. While a range of configurations proved viable, we provide a recommended Selected National Configuration 1 km resolution subsurface dataset for use in distributed large‐and continental‐scale hydrologic modeling. 
    more » « less