skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Bridging data silos to holistically model plant macrophenology
Summary Phenological response to global climate change can impact ecosystem functions. There are various data sources from which spatiotemporal and taxonomic phenological data may be obtained: mobilized herbaria, community science initiatives, observatory networks, and remote sensing. However, analyses conducted to date have generally relied on single sources of these data. Siloed treatment of data in analyses may be due to the lack of harmonization across different data sources that offer partially nonoverlapping information and are often complementary. Such treatment precludes a deeper understanding of phenological responses at varying macroecological scales. Here, we describe a detailed vision for the harmonization of phenological data, including the direct integration of disparate sources of phenological data using a common schema. Specifically, we highlight existing methods for data harmonization that can be applied to phenological data: data design patterns, metadata standards, and ontologies. We describe how harmonized data from multiple sources can be integrated into analyses using existing methods and discuss the use of automated extraction techniques. Data harmonization is not a new concept in ecology, but the harmonization of phenological data is overdue. We aim to highlight the need for better data harmonization, providing a roadmap for how harmonized phenological data may fill gaps while simultaneously being integrated into analyses.  more » « less
Award ID(s):
2105932 2217817 2101884
PAR ID:
10598902
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
New Phytologist
ISSN:
0028-646X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Phenological response to global climate change can impact ecosystem functions. There are various data sources from which spatiotemporal and taxonomic phenological data may be obtained: mobilized herbaria, community science initiatives, observatory networks, and remote sensing. However, analyses conducted to date have generally relied on single sources of these data. Siloed treatment of data in analyses may be due to the lack of harmonization across different data sources that offer partially nonoverlapping information and are often complementary. Such treatment precludes a deeper understanding of phenological responses at varying macroecological scales. Here, we describe a detailed vision for the harmonization of phenological data, including the direct integration of disparate sources of phenological data using a common schema. Specifically, we highlight existing methods for data harmonization that can be applied to phenological data: data design patterns, metadata standards, and ontologies. We describe how harmonized data from multiple sources can be integrated into analyses using existing methods and discuss the use of automated extraction techniques. Data harmonization is not a new concept in ecology, but the harmonization of phenological data is overdue.We aim to highlight the need for better data harmonization, providing a roadmap for how harmonized phenological data may fill gaps while simultaneously being integrated into analyses. 
    more » « less
  2. This SOils DAta Harmonization (SoDaH) database is designed to bring together soil carbon data from diverse research networks into a harmonized dataset that can be used for synthesis activities and model development. The research network sources for SoDaH span different biomes and climates, encompass multiple ecosystem types, and have collected data across a range of spatial, temporal, and depth gradients. The rich data sets assembled in SoDaH consist of observations from monitoring efforts and long-term ecological experiments. The SoDaH database also incorporates related environmental covariate data pertaining to climate, vegetation, soil chemistry, and soil physical properties. The data are harmonized and aggregated using open-source code that enables a scripted, repeatable approach for soil data synthesis. 
    more » « less
  3. Abstract The use of external controls in genome-wide association study (GWAS) can significantly increase the size and diversity of the control sample, enabling high-resolution ancestry matching and enhancing the power to detect association signals. However, the aggregation of controls from multiple sources is challenging due to batch effects, difficulty in identifying genotyping errors and the use of different genotyping platforms. These obstacles have impeded the use of external controls in GWAS and can lead to spurious results if not carefully addressed. We propose a unified data harmonization pipeline that includes an iterative approach to quality control and imputation, implemented before and after merging cohorts and arrays. We apply this harmonization pipeline to aggregate 27 517 European control samples from 16 collections within dbGaP. We leverage these harmonized controls to conduct a GWAS of Crohn’s disease. We demonstrate a boost in power over using the cohort samples alone, and that our procedure results in summary statistics free of any significant batch effects. This harmonization pipeline for aggregating genotype data from multiple sources can also serve other applications where individual level genotypes, rather than summary statistics, are required. 
    more » « less
  4. null (Ed.)
    Metabolic models have been proven to be useful tools in system biology and have been successfully applied to various research fields in a wide range of organisms. A relatively complete metabolic network is a prerequisite for deriving reliable metabolic models. The first step in constructing metabolic network is to harmonize compounds and reactions across different metabolic databases. However, effectively integrating data from various sources still remains a big challenge. Incomplete and inconsistent atomistic details in compound representations across databases is a very important limiting factor. Here, we optimized a subgraph isomorphism detection algorithm to validate generic compound pairs. Moreover, we defined a set of harmonization relationship types between compounds to deal with inconsistent chemical details while successfully capturing atom-level characteristics, enabling a more complete enabling compound harmonization across metabolic databases. In total, 15,704 compound pairs across KEGG (Kyoto Encyclopedia of Genes and Genomes) and MetaCyc databases were detected. Furthermore, utilizing the classification of compound pairs and EC (Enzyme Commission) numbers of reactions, we established hierarchical relationships between metabolic reactions, enabling the harmonization of 3856 reaction pairs. In addition, we created and used atom-specific identifiers to evaluate the consistency of atom mappings within and between harmonized reactions, detecting some consistency issues between the reaction and compound descriptions in these metabolic databases. 
    more » « less
  5. Abstract The number and diversity of phenological studies has increased rapidly in recent years. Innovative experiments, field studies, citizen science projects, and analyses of newly available historical data are contributing insights that advance our understanding of ecological and evolutionary responses to the environment, particularly climate change. However, many phenological data sets have peculiarities that are not immediately obvious and can lead to mistakes in analyses and interpretation of results. This paper aims to help researchers, especially those new to the field of phenology, understand challenges and practices that are crucial for effective studies. For example, researchers may fail to account for sampling biases in phenological data, struggle to choose or design a volunteer data collection strategy that adequately fits their project’s needs, or combine data sets in inappropriate ways. We describe ten best practices for designing studies of plant and animal phenology, evaluating data quality, and analyzing data. Practices include accounting for common biases in data, using effective citizen or community science methods, and employing appropriate data when investigating phenological mismatches. We present these best practices to help researchers entering the field take full advantage of the wealth of available data and approaches to advance our understanding of phenology and its implications for ecology. 
    more » « less