skip to main content


Title: Reviews and syntheses: The promise of big diverse soil data, moving current practices towards future potential
Abstract. In the age of big data, soil data are more available and richer than ever, but – outside of a few large soil survey resources – they remain largely unusable for informing soil management and understanding Earth system processes beyond the original study.Data science has promised a fully reusable research pipeline where data from past studies are used to contextualize new findings and reanalyzed for new insight.Yet synthesis projects encounter challenges at all steps of the data reuse pipeline, including unavailable data, labor-intensive transcription of datasets, incomplete metadata, and a lack of communication between collaborators.Here, using insights from a diversity of soil, data, and climate scientists, we summarize current practices in soil data synthesis across all stages of database creation: availability, input, harmonization, curation, and publication.We then suggest new soil-focused semantic tools to improve existing data pipelines, such as ontologies, vocabulary lists, and community practices.Our goal is to provide the soil data community with an overview of current practices in soil data and where we need to go to fully leverage big data to solve soil problems in the next century.  more » « less
Award ID(s):
1655622
NSF-PAR ID:
10352560
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; « less
Date Published:
Journal Name:
Biogeosciences
Volume:
19
Issue:
14
ISSN:
1726-4189
Page Range / eLocation ID:
3505 to 3522
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. As we look to the future of natural history collections and a global integration of biodiversity data, we are reliant on a diverse workforce with the skills necessary to build, grow, and support the data, tools, and resources of the Digital Extended Specimen (DES; Webster 2019, Lendemer et al. 2020, Hardisty 2020). Future “DES Data Curators” – those who will be charged with maintaining resources created through the DES – will require skills and resources beyond what is currently available to most natural history collections staff. In training the workforce to support the DES we have an opportunity to broaden our community and ensure that, through the expansion of biodiversity data, the workforce landscape itself is diverse, equitable, inclusive, and accessible. A fully-implemented DES will provide training that encapsulates capacity building, skills development, unifying protocols and best practices guidance, and cutting-edge technology that also creates inclusive, equitable, and accessible systems, workflows, and communities. As members of the biodiversity community and the current workforce, we can leverage our knowledge and skills to develop innovative training models that: include a range of educational settings and modalities; address the needs of new communities not currently engaged with digital data; from their onset, provide attribution for past and future work and do not perpetuate the legacy of colonial practices and historic inequalities found in many physical natural history collections. Recent reports from the Biodiversity Collections Network (BCoN 2019) and the National Academies of Science, Engineering and Medicine (National Academies of Sciences, Engineering, and Medicine 2020) specifically address workforce needs in support of the DES. To address workforce training and inclusivity within the context of global data integration, the Alliance for Biodiversity Knowledge included a topic on Workforce capacity development and inclusivity in Phase 2 of the consultation on Converging Digital Specimens and Extended Specimens - Towards a global specification for data integration. Across these efforts, several common themes have emerged relative to workforce training and the DES. A call for a community needs assessment: As a community, we have several unknowns related to the current collections workforce and training needs. We would benefit from a baseline assessment of collections professionals to define current job responsibilities, demographics, education and training, incentives, compensation, and benefits. This includes an evaluation of current employment prospects and opportunities. Defined skills and training for the 21st century collections professional: We need to be proactive and define the 21st century workforce skills necessary to support the development and implementation of the DES. When we define the skills and content needs we can create appropriate training opportunities that include scalable materials for capacity building, educational materials that develop relevant skills, unifying protocols across the DES network, and best practices guidance for professionals. Training for data end-users: We need to train data end-users in biodiversity and data science at all levels of formal and informal education from primary and secondary education through the existing workforce. This includes developing training and educational materials, creating data portals, and building analyses that are inclusive, accessible, and engage the appropriate community of science educators, data scientists, and biodiversity researchers. Foster a diverse, equitable, inclusive, and accessible and professional workforce: As the DES develops and new tools and resources emerge, we need to be intentional in our commitment to building tools that are accessible and in assuring that access is equitable. This includes establishing best practices to ensure the community providing and accessing data is inclusive and representative of the diverse global community of potential data providers and users. Upfront, we must acknowledge and address issues of historic inequalities and colonial practices and provide appropriate attribution for past and future work while ensuring legal and regulatory compliance. Efforts must include creating transparent linkages among data and the humans that create the data that drives the DES. In this presentation, we will highlight recommendations for building workforce capacity within the DES that are diverse, inclusive, equitable and accessible, take into account the requirements of the biodiversity science community, and that are flexible to meet the needs of an evolving field. 
    more » « less
  2. The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets. 
    more » « less
  3. Big data, the “new oil” of the modern data science era, has attracted much attention in the GIScience community. However, we have ignored the role of code in enabling the big data revolution in this modern gold rush. Instead, what attention code has received has focused on computational efficiency and scalability issues. In contrast, we have missed the opportunities that the more transformative aspects of code afford as ways to organize our science. These “big code” practices hold the potential for addressing some ill effects of big data that have been rightly criticized, such as algorithmic bias, lack of representation, gatekeeping, and issues of power imbalances in our communities. In this article, I consider areas where lessons from the open source community can help us evolve a more inclusive, generative, and expansive GIScience. These concern best practices for codes of conduct, data pipelines and reproducibility, refactoring our attribution and reward systems, and a reinvention of our pedagogy.

     
    more » « less
  4. Abstract

    Currently time-domain astronomy can scan the entire sky on a daily basis, discovering thousands of interesting transients every night. Classifying the ever-increasing number of new transients is one of the main challenges for the astronomical community. One solution that addresses this issue is the robotically controlled Spectral Energy Distribution Machine (SEDM) which supports the Zwicky Transient Facility (ZTF). SEDM with its pipelinepysedmdemonstrates that real-time robotic spectroscopic classification is feasible. In an effort to improve the quality of the current SEDM data, we present here two new modules,byecrandcontsep. The first removes contamination from cosmic rays, and the second removes contamination from non-target light. These new modules are part of the automatedpysedmpipeline and fully integrated with the whole process. Employingbyecrandcontsepmodules together automatically extracts more spectra than the currentpysedmpipeline. UsingSNIDclassification results, the new modules show an improvement in the classification rate and accuracy of 2.8% and 1.7%, respectively, while the strength of the cross-correlation remains the same. Improvements to the SEDM astrometry would further boost the improvement of thecontsepmodule. This kind of robotic follow-up with a fully automated pipeline has the potential to provide the spectroscopic classifications for the transients discovered by ZTF and also by the Rubin Observatory’s Legacy Survey of Space and Time.

     
    more » « less
  5. Abstract Background

    Antarctica and its unique biodiversity are increasingly at risk from the effects of global climate change and other human influences. A significant recent element underpinning strategies for Antarctic conservation has been the development of a system of Antarctic Conservation Biogeographic Regions (ACBRs). The datasets supporting this classification are, however, dominated by eukaryotic taxa, with contributions from the bacterial domain restricted to Actinomycetota and Cyanobacteriota. Nevertheless, the ice-free areas of the Antarctic continent and the sub-Antarctic islands are dominated in terms of diversity by bacteria. Our study aims to generate a comprehensive phylogenetic dataset of Antarctic bacteria with wide geographical coverage on the continent and sub-Antarctic islands, to investigate whether bacterial diversity and distribution is reflected in the current ACBRs.

    Results

    Soil bacterial diversity and community composition did not fully conform with the ACBR classification. Although 19% of the variability was explained by this classification, the largest differences in bacterial community composition were between the broader continental and maritime Antarctic regions, where a degree of structural overlapping within continental and maritime bacterial communities was apparent, not fully reflecting the division into separate ACBRs. Strong divergence in soil bacterial community composition was also apparent between the Antarctic/sub-Antarctic islands and the Antarctic mainland. Bacterial communities were partially shaped by bioclimatic conditions, with 28% of dominant genera showing habitat preferences connected to at least one of the bioclimatic variables included in our analyses. These genera were also reported as indicator taxa for the ACBRs.

    Conclusions

    Overall, our data indicate that the current ACBR subdivision of the Antarctic continent does not fully reflect bacterial distribution and diversity in Antarctica. We observed considerable overlap in the structure of soil bacterial communities within the maritime Antarctic region and within the continental Antarctic region. Our results also suggest that bacterial communities might be impacted by regional climatic and other environmental changes. The dataset developed in this study provides a comprehensive baseline that will provide a valuable tool for biodiversity conservation efforts on the continent. Further studies are clearly required, and we emphasize the need for more extensive campaigns to systematically sample and characterize Antarctic and sub-Antarctic soil microbial communities.

     
    more » « less