skip to main content


Title: MetaFAIR : A Metadata Application Profile for Managing Research Data
Abstract

This paper reports on the development of a metadata application profile (AP), MetaFAIR, designed to support research data management (RDM) to make research data findable, accessible, interoperable, and reusable. The development of MetaFAIR followed a three‐step process that included learning about the characteristics of datasets from researchers to establish their context and requirements, as well as iterative design and testing with researchers' feedback. Guided by the FAIR principles (Findability, Accessibility, Interoperability, and Reusability), MetaFAIR focuses on accommodating description needs particular to computational social science datasets while seeking to provide general enough elements to describe data collections across many different domains. In this paper, MetaFAIR is placed in the context of historical and recent developments in the areas of RDM and application profile creation; following this contextualization, the paper describes the central considerations and challenges of the MetaFAIR development process and discusses its significance for future work in RDM.

 
more » « less
NSF-PAR ID:
10305739
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Proceedings of the Association for Information Science and Technology
Volume:
58
Issue:
1
ISSN:
2373-9231
Page Range / eLocation ID:
p. 337-345
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Why the new findings matter

    The process of teaching and learning is complex, multifaceted and dynamic. This paper contributes a seminal resource to highlight the digitisation of the educational sciences by demonstrating how new machine learning methods can be effectively and reliably used in research, education and practical application.

    Implications for educational researchers and policy makers

    The progressing digitisation of societies around the globe and the impact of the SARS‐COV‐2 pandemic have highlighted the vulnerabilities and shortcomings of educational systems. These developments have shown the necessity to provide effective educational processes that can support sometimes overwhelmed teachers to digitally impart knowledge on the plan of many governments and policy makers. Educational scientists, corporate partners and stakeholders can make use of machine learning techniques to develop advanced, scalable educational processes that account for individual needs of learners and that can complement and support existing learning infrastructure. The proper use of machine learning methods can contribute essential applications to the educational sciences, such as (semi‐)automated assessments, algorithmic‐grading, personalised feedback and adaptive learning approaches. However, these promises are strongly tied to an at least basic understanding of the concepts of machine learning and a degree of data literacy, which has to become the standard in education and the educational sciences.

    Demonstrating both the promises and the challenges that are inherent to the collection and the analysis of large educational data with machine learning, this paper covers the essential topics that their application requires and provides easy‐to‐follow resources and code to facilitate the process of adoption.

     
    more » « less
  2. Abstract  
    more » « less
  3. Abstract

    Large-scale genotype and phenotype data have been increasingly generated to identify genetic markers, understand gene function and evolution and facilitate genomic selection. These datasets hold immense value for both current and future studies, as they are vital for crop breeding, yield improvement and overall agricultural sustainability. However, integrating these datasets from heterogeneous sources presents significant challenges and hinders their effective utilization. We established the Genotype-Phenotype Working Group in November 2021 as a part of the AgBioData Consortium (https://www.agbiodata.org) to review current data types and resources that support archiving, analysis and visualization of genotype and phenotype data to understand the needs and challenges of the plant genomic research community. For 2021–22, we identified different types of datasets and examined metadata annotations related to experimental design/methods/sample collection, etc. Furthermore, we thoroughly reviewed publicly funded repositories for raw and processed data as well as secondary databases and knowledgebases that enable the integration of heterogeneous data in the context of the genome browser, pathway networks and tissue-specific gene expression. Based on our survey, we recommend a need for (i) additional infrastructural support for archiving many new data types, (ii) development of community standards for data annotation and formatting, (iii) resources for biocuration and (iv) analysis and visualization tools to connect genotype data with phenotype data to enhance knowledge synthesis and to foster translational research. Although this paper only covers the data and resources relevant to the plant research community, we expect that similar issues and needs are shared by researchers working on animals.

    Database URL: https://www.agbiodata.org.

     
    more » « less
  4. Abstract

    The Rocky Mountain Biological Laboratory (RMBL; Colorado, USA) is the site for many research projects spanning decades, taxa, and research fields from ecology to evolutionary biology to hydrology and beyond. Climate is the focus of much of this work and provides important context for the rest. There are five major sources of data on climate in the RMBL vicinity, each with unique variables, formats, and temporal coverage. These data sources include (1) RMBL resident billy barr, (2) the National Oceanic and Atmospheric Administration (NOAA), (3) the United States Geological Survey (USGS), (4) the United States Department of Agriculture (USDA), and (5) Oregon State University's PRISM Climate Group. Both the NOAA and the USGS have automated meteorological stations in Crested Butte, CO, ~10 km from the RMBL, while the USDA has an automated meteorological station on Snodgrass Mountain, ~2.5 km from the RMBL. Each of these data sets has unique spatial and temporal coverage and formats. Despite the wealth of work on climate‐related questions using data from the RMBL, previous researchers have each had to access and format their own climate records, make decisions about handling missing data, and recreate data summaries. Here we provide a single curated climate data set of daily observations covering the years 1975–2022 that blends information from all five sources and includes annotated scripts documenting decisions for handling data. These synthesized climate data will facilitate future research, reduce duplication of effort, and increase our ability to compare results across studies. The data set includes information on precipitation (water and snow), snowmelt date, temperature, wind speed, soil moisture and temperature, and stream flows, all publicly available from a combination of sources. In addition to the formatted raw data, we provide several new variables that are commonly used in ecological analyses, including growing degree days, growing season length, a cold severity index, hard frost days, an index of El Niño‐Southern Oscillation, and aridity (standardized precipitation evapotranspiration index). These new variables are calculated from the daily weather records. As appropriate, data are also presented as minima, maxima, means, residuals, and cumulative measures for various time scales including days, months, seasons, and years. The RMBL is a global research hub. Scientists on site at the RMBL come from many countries and produce about 50 peer‐reviewed publications each year. Researchers from around the world also routinely use data from the RMBL for synthetic work, and educators around the United States use data from the RMBL for teaching modules. This curated and combined data set will be useful to a wide audience. Along with the synthesized combined data set we include the raw data and the R code for cleaning the raw data and creating the monthly and yearly data sets, which facilitate adding additional years or data using the same standardized protocols. No copyright or proprietary restrictions are associated with using this data set; please cite this data paper when the data are used in publications or scientific events.

     
    more » « less
  5. SUMMARY

    Cis‐regulatory elements (CREs) are important sequences for gene expression and for plant biological processes such as development, evolution, domestication, and stress response. However, studying CREs in plant genomes has been challenging. The totipotent nature of plant cells, coupled with the inability to maintain plant cell types in culture and the inherent technical challenges posed by the cell wall has limited our understanding of how plant cell types acquire and maintain their identities and respond to the environment via CRE usage. Advances in single‐cell epigenomics have revolutionized the field of identifying cell‐type‐specific CREs. These new technologies have the potential to significantly advance our understanding of plant CRE biology, and shed light on how the regulatory genome gives rise to diverse plant phenomena. However, there are significant biological and computational challenges associated with analyzing single‐cell epigenomic datasets. In this review, we discuss the historical and foundational underpinnings of plant single‐cell research, challenges, and common pitfalls in the analysis of plant single‐cell epigenomic data, and highlight biological challenges unique to plants. Additionally, we discuss how the application of single‐cell epigenomic data in various contexts stands to transform our understanding of the importance of CREs in plant genomes.

     
    more » « less