skip to main content


Title: SOils DAta Harmonization database (SoDaH): an open-source synthesis of soil data from research networks
This SOils DAta Harmonization (SoDaH) database is designed to bring together soil carbon data from diverse research networks into a harmonized dataset that can be used for synthesis activities and model development. The research network sources for SoDaH span different biomes and climates, encompass multiple ecosystem types, and have collected data across a range of spatial, temporal, and depth gradients. The rich data sets assembled in SoDaH consist of observations from monitoring efforts and long-term ecological experiments. The SoDaH database also incorporates related environmental covariate data pertaining to climate, vegetation, soil chemistry, and soil physical properties. The data are harmonized and aggregated using open-source code that enables a scripted, repeatable approach for soil data synthesis.  more » « less
Award ID(s):
1929393
NSF-PAR ID:
10328143
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; « less
Publisher / Repository:
Environmental Data Initiative
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Public opinion surveys constitute a widespread, powerful tool to study peoples’ attitudes and behaviors from comparative perspectives. However, even global surveys can have limited geographic and temporal coverage, which can hinder the production of comprehensive knowledge. To expand the scope of comparison, social scientists turn to ex-post harmonization of variables from datasets that cover similar topics but in different populations and/or at different times. These harmonized datasets can be analyzed as a single source and accessed through various data portals. However, the Survey Data Recycling (SDR) research project has identified three challenges faced by social scientists when using data portals: the lack of capability to explore data in-depth or query data based on customized needs, the difficulty in efficiently identifying related data for studies, and the incapability to evaluate theoretical models using sliced data. To address these issues, the SDR research project has developed the SDR Querier, which is applied to the harmonized SDR database. The SDR Querier includes a BERT-based model that allows for customized data queries through research questions or keywords (Query-by-Question), a visual design that helps users determine the availability of harmonized data for a given research question (Query-by-Condition), and the ability to reveal the underlying relational patterns among substantive and methodological variables in the database (Query-by-Relation), aiding in the rigorous evaluation or improvement of regression models. Case studies with multiple social scientists have demonstrated the usefulness and effectiveness of the SDR Querier in addressing daily challenges. 
    more » « less
  2. null (Ed.)
    Abstract. Data collected from research networks presentopportunities to test theories and develop models about factors responsiblefor the long-term persistence and vulnerability of soil organic matter(SOM). Synthesizing datasets collected by different research networkspresents opportunities to expand the ecological gradients and scientificbreadth of information available for inquiry. Synthesizing these data ischallenging, especially considering the legacy of soil data that havealready been collected and an expansion of new network science initiatives.To facilitate this effort, here we present the SOils DAta Harmonizationdatabase (SoDaH; https://lter.github.io/som-website, last access: 22 December 2020), a flexible database designed to harmonize diverse SOM datasets frommultiple research networks. SoDaH is built on several network scienceefforts in the United States, but the tools built for SoDaH aim to providean open-access resource to facilitate synthesis of soil carbon data.Moreover, SoDaH allows for individual locations to contribute results fromexperimental manipulations, repeated measurements from long-term studies,and local- to regional-scale gradients across ecosystems or landscapes.Finally, we also provide data visualization and analysis tools that can beused to query and analyze the aggregated database. The SoDaH v1.0 dataset isarchived and availableat https://doi.org/10.6073/pasta/9733f6b6d2ffd12bf126dc36a763e0b4 (Wieder et al., 2020). 
    more » « less
  3. Abstract. In the age of big data, soil data are more available and richer than ever, but – outside of a few large soil survey resources – they remain largely unusable for informing soil management and understanding Earth system processes beyond the original study.Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for new insight.Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators.Here, using insights from a diversity of soil, data, and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: availability, input, harmonization, curation, and publication.We then suggest new soil-focused semantic tools to improve existing data pipelines, such as ontologies, vocabulary lists, and community practices.Our goal is to provide the soil data community with an overview of current practices in soil data and where we need to go to fully leverage big data to solve soil problems in the next century. 
    more » « less
  4. Abstract

    Estimates of soil organic carbon (SOC) stocks are essential for many environmental applications. However, significant inconsistencies exist in SOC stock estimates for the U.S. across current SOC maps. We propose a framework that combines unsupervised multivariate geographic clustering (MGC) and supervised Random Forests regression, improving SOC maps by capturing heterogeneous relationships with SOC drivers. We first used MGC to divide the U.S. into 20 SOC regions based on the similarity of covariates (soil biogeochemical, bioclimatic, biological, and physiographic variables). Subsequently, separate Random Forests models were trained for each SOC region, utilizing environmental covariates and SOC observations. Our estimated SOC stocks for the U.S. (52.6 ± 3.2 Pg for 0–30 cm and 108.3 ± 8.2 Pg for 0–100 cm depth) were within the range estimated by existing products like Harmonized World Soil Database, HWSD (46.7 Pg for 0–30 cm and 90.7 Pg for 0–100 cm depth) and SoilGrids 2.0 (45.7 Pg for 0–30 cm and 133.0 Pg for 0–100 cm depth). However, independent validation with soil profile data from the National Ecological Observatory Network showed that our approach (R2 = 0.51) outperformed the estimates obtained from Harmonized World Soil Database (R2 = 0.23) and SoilGrids 2.0 (R2 = 0.39) for the topsoil (0–30 cm). Uncertainty analysis (e.g., low representativeness and high coefficients of variation) identified regions requiring more measurements, such as Alaska and the deserts of the U.S. Southwest. Our approach effectively captures the heterogeneous relationships between widely available predictors and the current SOC baseline across regions, offering reliable SOC estimates at 1 km resolution for benchmarking Earth system models.

     
    more » « less
  5. Abstract The National Ecological Observatory Network (NEON) is a multidecadal and continental-scale observatory with sites across the United States. Having entered its operational phase in 2018, NEON data products, software, and services become available to facilitate research on the impacts of climate change, land-use change, and invasive species. An essential component of NEON are its 47 tower sites, where eddy-covariance (EC) sensors are operated to determine the surface–atmosphere exchange of momentum, heat, water, and CO 2 . EC tower networks such as AmeriFlux, the Integrated Carbon Observation System (ICOS), and NEON are vital for providing the distributed observations to address interactions at the soil–vegetation–atmosphere interface. NEON represents the largest single-provider EC network globally, with standardized observations and data processing explicitly designed for intersite comparability and analysis of feedbacks across multiple spatial and temporal scales. Furthermore, EC is tightly integrated with soil, meteorology, atmospheric chemistry, isotope, phenology, and rich contextual observations such as airborne remote sensing and in situ sampling bouts. Here, we present an overview of NEON’s observational design, field operation, and data processing that yield community resources for the study of surface–atmosphere interactions. Near-real-time data products become available from the NEON Data Portal, and EC and meteorological data are ingested into AmeriFlux and FLUXNET globally harmonized data releases. Open-source software for reproducible, extensible, and portable data analysis includes the eddy4R family of R packages underlying the EC data product generation. These resources strive to integrate with existing infrastructures and networks, to suggest novel systemic solutions, and to synergize ongoing research efforts across science communities. 
    more » « less