skip to main content


This content will become publicly available on November 1, 2024

Title: What about Model Data? Best Practices for Preservation and Replicability
Abstract

It has become common for researchers to make their data publicly available to meet the data management and accessibility requirements of funding agencies and scientific publishers. However, many researchers face the challenge of determining what data to preserve and share and where to preserve and share those data. This can be especially challenging for those who run dynamical models, which can produce complex, voluminous data outputs, and have not considered what outputs may need to be preserved and shared as part of the project design. This manuscript presents findings from the NSF EarthCube Research Coordination Network project titled “What About Model Data? Best Practices for Preservation and Replicability” (https://modeldatarcn.github.io/). These findings suggest that if the primary goal of sharing data are to communicate knowledge, most simulation-based research projects only need to preserve and share selected model outputs along with the full simulation experiment workflow. One major result of this project has been the development of a rubric, designed to provide guidance for making decisions on what simulation output needs to be preserved and shared in trusted community repositories to achieve the goal of knowledge communication. This rubric, along with use cases for selected projects, provide scientists with guidance on data accessibility requirements in the planning process of research, allowing for more thoughtful development of data management plans and funding requests. Additionally, this rubric can be referred to by publishers for what is expected in terms of data accessibility for publication.

 
more » « less
Award ID(s):
1929757
NSF-PAR ID:
10477877
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
American Meteorological Society
Date Published:
Journal Name:
Bulletin of the American Meteorological Society
Volume:
104
Issue:
11
ISSN:
0003-0007
Page Range / eLocation ID:
E2053 to E2064
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. There is strong agreement across the sciences that replicable workflows are needed for computational modeling. Open and replicable workflows not only strengthen public confidence in the sciences, but also result in more efficient community science. However, the massive size and complexity of geoscience simulation outputs, as well as the large cost to produce and preserve these outputs, present problems related to data storage, preservation, duplication, and replication. The simulation workflows themselves present additional challenges related to usability, understandability, documentation, and citation. These challenges make it difficult for researchers to meet the bewildering variety of data management requirements and recommendations across research funders and scientific journals. This paper introduces initial outcomes and emerging themes from the EarthCube Research Coordination Network project titled “What About Model Data? - Best Practices for Preservation and Replicability,” which is working to develop tools to assist researchers in determining what elements of geoscience modeling research should be preserved and shared to meet evolving community open science expectations. Specifically, the paper offers approaches to address the following key questions: • How should preservation of model software and outputs differ for projects that are oriented toward knowledge production vs. projects oriented toward data production? • What components of dynamical geoscience modeling research should be preserved and shared? • What curation support is needed to enable sharing and preservation for geoscience simulation models and their output? • What cultural barriers impede geoscience modelers from making progress on these topics? 
    more » « less
  2. null (Ed.)
    The Twitter-Based Knowledge Graph for Researchers project is an effort to construct a knowledge graph of computation-based tasks and corresponding outputs. It will be utilized by subject matter experts, statisticians, and developers. A knowledge graph is a directed graph of knowledge accumulated from a variety of sources. For our application, Subject Matter Experts (SMEs) are experts in their respective non-computer science fields, but are not necessarily experienced with running heavy computation on datasets. As a result, they find it difficult to generate workflows for their projects involving Twitter data and advanced analysis. Workflow management systems and libraries that facilitate computation are only practical when the users of these systems understand what analysis they need to perform. Our goal is to bridge this gap in understanding. Our queryable knowledge graph will generate a visual workflow for these experts and researchers to achieve their project goals. After meeting with our client, we established two primary deliverables. First, we needed to create an ontology of all Twitter-related information that an SME might want to answer. Secondly, we needed to build a knowledge graph based on this ontology and produce a set of APIs to trigger a set of network algorithms based on the information queried to the graph. An ontology is simply the class structure/schema for the graph. Throughout future meetings, we established some more specific additional requirements. Most importantly, the client stressed that users should be able to bring their own data and add it to our knowledge graph. As more research is completed and new technologies are released, it will be important to be able to edit and add to the knowledge graph. Next, we must be able to provide metrics about the data itself. These metrics will be useful for both our own work, and future research surrounding graph search problems and search optimization. Additionally, our system should provide users with information regarding the original domain that the algorithms and workflows were run against. That way they can choose the best workflow for their data. The project team first conducted a literature review, reading reports from the CS5604 Information Retrieval courses in 2016 and 2017 to extract information related to Twitter data and algorithms. This information was used to construct our raw ontology in Google Sheets, which contained a set of dataset-algorithm-dataset tuples. The raw ontology was then converted into nodes and edges csv files for building the knowledge graph. After implementing our original solution on a CentOS virtual machine hosted by the Virginia Tech Department of Computer Science, we transitioned our solution to Grakn, an open-source knowledge graph database that supports hypergraph functionality. When finalizing our workflow paths, we noted some nodes depended on completion of two or more inputs, representing an ”AND” edge. This phenomenon is modeled as a hyperedge with Grakn, initiating our transition from Neo4J to Grakn. Currently, our system supports queries through the console, where a user can type a Graql statement to retrieve information about data in the graph, from relationships to entities to derived rules. The user can also interact with the data via Grakn's data visualizer: Workbase. The user can enter Graql queries to visualize connections within the knowledge graph. 
    more » « less
  3. Abstract

    Open science and open data within scholarly research programs are growing both in popularity and by requirement from grant funding agencies and journal publishers. A central component of open data management, especially on collaborative, multidisciplinary, and multi-institutional science projects, is documentation of complete and accurate metadata, workflow, and source code in addition to access to raw data and data products to uphold FAIR (Findable, Accessible, Interoperable, Reusable) principles. Although best practice in data/metadata management is to use established internationally accepted metadata schemata, many of these standards are discipline-specific making it difficult to catalog multidisciplinary data and data products in a way that is easily findable and accessible. Consequently, scattered and incompatible metadata records create a barrier to scientific innovation, as researchers are burdened to find and link multidisciplinary datasets. One possible solution to increase data findability, accessibility, interoperability, reproducibility, and integrity within multi-institutional and interdisciplinary projects is a centralized and integrated data management platform. Overall, this type of interoperable framework supports reproducible open science and its dissemination to various stakeholders and the public in a FAIR manner by providing direct access to raw data and linking protocols, metadata and supporting workflow materials.

     
    more » « less
  4. This is a story about the challenges and opportunities that surfaced while answering a deceptively complex question - where's the data? As faculty and researchers publish articles, datasets, and other research outputs to meet promotion and tenure requirements, address federal funding policies, and institutional open access and data sharing policies, many online locations for publishing these materials have developed over time. How can we capture where all of the research generated on an academic campus is shared and preserved? This presentation will discuss how our multi-institution collaboration, the Reality of Academic Data Sharing (RADS) Initiative, sought to answer this question. We programmatically pulled DOIs from DataCite and CrossRef, making the naive assumption that these platforms, the two predominant DOI registration agencies for US data, would present us with a neutral and unbiased view of where data from our affiliated researchers were shared. However, as we dug into the data, we found inconsistencies in the use and completeness of the necessary metadata fields for our questions, as well as differences in how DOIs were assigned across repositories. Additionally, we recognized the systematic and privileged bias introduced by our choice of data sources. Specifically, while DataCite and CrossRef provide easy discovery of research outputs because they aggregate DOIs, they are also costly commercial services. Many repositories that cannot afford such services or lack local staffing and knowledge required to use these services are left out of the technology that has recently been labeled “global research infrastructure”. Our presentation will identify the challenges we encountered in conducting this research specifically around finding the data, and cleaning and interpreting the data. We will further engage the audience in a discussion around increasing representation in the global research infrastructure to discover and account for more research outputs. 
    more » « less
  5. Abstract

    Having the means to share research data openly is essential to modern science. For human research, a key aspect in this endeavor is obtaining consent from participants, not just to take part in a study, which is a basic ethical principle, but also to share their data with the scientific community. To ensure that the participants' privacy is respected, national and/or supranational regulations and laws are in place. It is, however, not always clear to researchers what the implications of those are, nor how to comply with them. The Open Brain Consent (https://open-brain-consent.readthedocs.io) is an international initiative that aims to provide researchers in the brain imaging community with information about data sharing options and tools. We present here a short history of this project and its latest developments, and share pointers to consent forms, including a template consent form that is compliant with the EU general data protection regulation. We also share pointers to an associated data user agreement that is not only useful in the EU context, but also for any researchers dealing with personal (clinical) data elsewhere.

     
    more » « less