There is strong agreement across the sciences that replicable workflows are needed for computational modeling. Open and replicable workflows not only strengthen public confidence in the sciences, but also result in more efficient community science. However, the massive size and complexity of geoscience simulation outputs, as well as the large cost to produce and preserve these outputs, present problems related to data storage, preservation, duplication, and replication. The simulation workflows themselves present additional challenges related to usability, understandability, documentation, and citation. These challenges make it difficult for researchers to meet the bewildering variety of data management requirements and recommendations across research funders and scientific journals. This paper introduces initial outcomes and emerging themes from the EarthCube Research Coordination Network project titled “What About Model Data? - Best Practices for Preservation and Replicability,” which is working to develop tools to assist researchers in determining what elements of geoscience modeling research should be preserved and shared to meet evolving community open science expectations. Specifically, the paper offers approaches to address the following key questions: • How should preservation of model software and outputs differ for projects that are oriented toward knowledge production vs. projects oriented toward data production? • What components of dynamical geoscience modeling research should be preserved and shared? • What curation support is needed to enable sharing and preservation for geoscience simulation models and their output? • What cultural barriers impede geoscience modelers from making progress on these topics?
more »
« less
What about Model Data? Best Practices for Preservation and Replicability
Abstract It has become common for researchers to make their data publicly available to meet the data management and accessibility requirements of funding agencies and scientific publishers. However, many researchers face the challenge of determining what data to preserve and share and where to preserve and share those data. This can be especially challenging for those who run dynamical models, which can produce complex, voluminous data outputs, and have not considered what outputs may need to be preserved and shared as part of the project design. This manuscript presents findings from the NSF EarthCube Research Coordination Network project titled “What About Model Data? Best Practices for Preservation and Replicability” (https://modeldatarcn.github.io/). These findings suggest that if the primary goal of sharing data are to communicate knowledge, most simulation-based research projects only need to preserve and share selected model outputs along with the full simulation experiment workflow. One major result of this project has been the development of a rubric, designed to provide guidance for making decisions on what simulation output needs to be preserved and shared in trusted community repositories to achieve the goal of knowledge communication. This rubric, along with use cases for selected projects, provide scientists with guidance on data accessibility requirements in the planning process of research, allowing for more thoughtful development of data management plans and funding requests. Additionally, this rubric can be referred to by publishers for what is expected in terms of data accessibility for publication.
more »
« less
- Award ID(s):
- 1929757
- PAR ID:
- 10477877
- Publisher / Repository:
- American Meteorological Society
- Date Published:
- Journal Name:
- Bulletin of the American Meteorological Society
- Volume:
- 104
- Issue:
- 11
- ISSN:
- 0003-0007
- Page Range / eLocation ID:
- E2053 to E2064
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Having the means to share research data openly is essential to modern science. For human research, a key aspect in this endeavor is obtaining consent from participants, not just to take part in a study, which is a basic ethical principle, but also to share their data with the scientific community. To ensure that the participants' privacy is respected, national and/or supranational regulations and laws are in place. It is, however, not always clear to researchers what the implications of those are, nor how to comply with them. The Open Brain Consent (https://open-brain-consent.readthedocs.io) is an international initiative that aims to provide researchers in the brain imaging community with information about data sharing options and tools. We present here a short history of this project and its latest developments, and share pointers to consent forms, including a template consent form that is compliant with the EU general data protection regulation. We also share pointers to an associated data user agreement that is not only useful in the EU context, but also for any researchers dealing with personal (clinical) data elsewhere.more » « less
-
Abstract Long‐term efforts have sought to extend global model resolution to smaller scales enabling more accurate descriptions of gravity wave (GW) sources and responses, given their major roles in coupling and variability throughout the atmosphere. Such studies reveal significant improvements accompanying increasing resolution, but no guidance on what is sufficient to approximate reality. We take the opposite approach, using a finite‐volume model solving the Navier‐Stokes equations exactly. The reference simulation addresses mountain wave (MW) generation and responses over the Southern Andes described using isotropic 500 m, central resolution by Fritts et al. (2021),https://doi.org/10.1175/JAS-D-20-0207.1and Lund et al. (2020),https://doi.org/10.1175/JAS-D-19-0356.1. Reductions of horizontal resolution to 1 and 2 km result in (a) systematic increases in initial MW breaking altitudes, (b) weaker, larger‐scale generation of secondary GWs and acoustic waves accompanying these dynamics, and (c) significantly weaker and less extended responses in the mesosphere in latitude and longitude. Horizontal resolution of 4 km largely suppresses instabilities, but allows weak, sustained mean‐flow interactions. Responses for 8 km resolution are very weak and fail to capture any aspects of the high‐resolution responses. The chosen mean winds allow efficient MW penetration into the mesosphere and lower thermosphere, hence only exhibit strong pseudo‐momentum deposition and mean wind decelerations at higher altitudes. A companion paper by Fritts et al. (2022),https://doi.org/10.1029/2021JD036035explores the impacts of decreasing resolution on responses in the thermosphere.more » « less
-
Abstract BackgroundComputational cell type deconvolution enables the estimation of cell type abundance from bulk tissues and is important for understanding tissue microenviroment, especially in tumor tissues. With rapid development of deconvolution methods, many benchmarking studies have been published aiming for a comprehensive evaluation for these methods. Benchmarking studies rely on cell-type resolved single-cell RNA-seq data to create simulated pseudobulk datasets by adding individual cells-types in controlled proportions. ResultsIn our work, we show that the standard application of this approach, which uses randomly selected single cells, regardless of the intrinsic difference between them, generates synthetic bulk expression values that lack appropriate biological variance. We demonstrate why and how the current bulk simulation pipeline with random cells is unrealistic and propose a heterogeneous simulation strategy as a solution. The heterogeneously simulated bulk samples match up with the variance observed in real bulk datasets and therefore provide concrete benefits for benchmarking in several ways. We demonstrate that conceptual classes of deconvolution methods differ dramatically in their robustness to heterogeneity with reference-free methods performing particularly poorly. For regression-based methods, the heterogeneous simulation provides an explicit framework to disentangle the contributions of reference construction and regression methods to performance. Finally, we perform an extensive benchmark of diverse methods across eight different datasets and find BayesPrism and a hybrid MuSiC/CIBERSORTx approach to be the top performers. ConclusionsOur heterogeneous bulk simulation method and the entire benchmarking framework is implemented in a user friendly packagehttps://github.com/humengying0907/deconvBenchmarkingandhttps://doi.org/10.5281/zenodo.8206516, enabling further developments in deconvolution methods.more » « less
-
Summary High performance computing (HPC) has led to remarkable advances in science and engineering and has become an indispensable tool for research. Unfortunately, HPC use and adoption by many researchers is often hindered by the complex way these resources are accessed. Indeed, while the web has become the dominant access mechanism for remote computing services in virtually every computing area, HPC is a notable exception. Open OnDemand is an open source project negating this trend by providing web‐based access to HPC resources (https://openondemand.org). This article describes the challenges to adoption and other lessons learned over the 3‐year project that may be relevant to other science gateway projects. We end with a description of future plans the project team has during the Open OnDemand 2.0 project including specific developments in machine learning and GPU monitoring.more » « less