Abstract Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We presentBeeBDC, a newRpackage, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducibleBeeBDC R-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and, we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. TheBeeBDCpackage with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducibleRworkflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation.
more »
« less
Database interoperability, uncertainty quantification and reproducible workflows in the paleogeosciences
The paleogeosciences are becoming more and more interdisciplinary, and studies increasingly rely on large collections of data derived from multiple data repositories. Integrating diverse datasets from multiple sources into complex workflows increases the challenge of creating reproducible and open science, as data formats and tools are often noninteroperable, requiring manual manipulation of data into standardized formats, resulting in a disconnect in data provenance and confounding reproducibility. Here we present a notebook that demonstrates how the Linked PaleoData (LiPD) framework is used as an interchange format to allow data from multiple data sources to be integrated in a complex workflow using emerging packages in R for geochronological uncertainty quantification and abrupt change detection. Specifically, in this notebook, we use the neotoma2 and lipdR packages to access paleoecological data from the Neotoma Database, and paleoclimate data from compilations hosted on Lipdverse. Age uncertainties for datasets from both sources are then quantified using the geoChronR package, and those data, and their associated age uncertainties, are then investigated for abrupt changes using the actR package, with accompanying visualizations. The result is an integrated, reproducible workflow in R that demonstrates how this complex series of multisource data integration, analysis and visualization can be integrated into an efficient, open scientific narrative.
more »
« less
- Award ID(s):
- 1929460
- PAR ID:
- 10337239
- Editor(s):
- McHenry, K; Schreiber, L
- Date Published:
- Journal Name:
- Earthcube Annual Meeting 2022
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Understanding patterns and drivers of species distribution and abundance, and thus biodiversity, is a core goal of ecology. Despite advances in recent decades, research into these patterns and processes is currently limited by a lack of standardized, high‐quality, empirical data that span large spatial scales and long time periods. The NEON fills this gap by providing freely available observational data that are generated during robust and consistent organismal sampling of several sentinel taxonomic groups within 81 sites distributed across the United States and will be collected for at least 30 years. The breadth and scope of these data provide a unique resource for advancing biodiversity research. To maximize the potential of this opportunity, however, it is critical that NEON data be maximally accessible and easily integrated into investigators' workflows and analyses. To facilitate its use for biodiversity research and synthesis, we created a workflow to process and format NEON organismal data into the ecocomDP (ecological community data design pattern) format that were available through the ecocomDP R package; we then provided the standardized data as an R data package (neonDivData). We briefly summarize sampling designs and data wrangling decisions for the major taxonomic groups included in this effort. Our workflows are open‐source so the biodiversity community may: add additional taxonomic groups; modify the workflow to produce datasets appropriate for their own analytical needs; and regularly update the data packages as more observations become available. Finally, we provide two simple examples of how the standardized data may be used for biodiversity research. By providing a standardized data package, we hope to enhance the utility of NEON organismal data in advancing biodiversity research and encourage the use of the harmonized ecocomDP data design pattern for community ecology data from other ecological observatory networks.more » « less
-
Detection of differential transcript usage (DTU) from RNA-seq data is an important bioinformatic analysis that complements differential gene expression analysis. Here we present a simple workflow using a set of existing R/Bioconductor packages for analysis of DTU. We show how these packages can be used downstream of RNA-seq quantification using the Salmon software package. The entire pipeline is fast, benefiting from inference steps by Salmon to quantify expression at the transcript level. The workflow includes live, runnable code chunks for analysis using DRIMSeq and DEXSeq, as well as for performing two-stage testing of DTU using the stageR package, a statistical framework to screen at the gene level and then confirm which transcripts within the significant genes show evidence of DTU. We evaluate these packages and other related packages on a simulated dataset with parameters estimated from real data.more » « less
-
GRASS is an open-source geospatial processing engine. With over 400 tools available in the core distribution and an additional 400+ tools available as extensions, GRASS has broad applicability in the Earth Sciences and geomorphometry in particular. In this workshop, we will give an introduction to GRASS and demonstrate some of the geomorphometry tools available in GRASS. Specifically, we will show how to compute stream extraction uncertainty using a workflow adapted from Hengl (2007) [1] and Hengl (2010) [2]. We will begin by downloading publicly available LiDAR data of the Perugia area using GRASS data fetching tools. Then, we will use R’s kriging functions (gstat) to create 100 iterations of a DEM. After exploring some of the stream extraction and flow routing methods available in GRASS, we will extract streams from each of the 100 DEMs to compute stream uncertainty. The workshop will be conducted in a Jupyter Notebook hosted in Google Colab. By the end of the workshop, participants will have hands on experience with: Creating GRASS projects and importing datasets, Adjusting the computational extent and resolution, Creating DEMs from point data using a variety of methods implemented in GRASS and using a stochastic kriging approach in R, Using the R interface for GRASS and R packages with GRASS data, Computing Stream Uncertainty, Developing publication-quality figures with grass.jupytermore » « less
-
Abstract. Technologies such as machine learning and deep learning are powering the discovery of meaningful patterns in Earth science big data. In the field of mineralogy, Mindat (“mindat.org”) is one of the largest databases. Although its front-end website is open and free, a machine interface for bulk data query and download had never been set up before 2022. Through a project called OpenMindat, an application programming interface (API) to enable open data query and access from Mindat was set up in 2023. To further lower the barrier between Mindat open data and geoscientists with limited coding skills, we developed an R package (OpenMindat v1.0.0) on top of the API. The Mindat API includes multiple data subjects such as geomaterials (e.g., rocks, minerals, synonyms, variety, mixture, and commodity), localities, and the IMA-approved (International Mineralogical Association) mineral list. The OpenMindat v1.0.0 package wraps the capabilities of the Mindat API and is designed to be user-friendly and extensible. In addition to providing functions for querying data subjects on the API, the package supports exporting data to various formats. In real-world applications, these functions only require minor coding for users to get desired datasets, and various other packages in the R environment can be used to analyze and visualize the data. The OpenMindat v1.0.0 package, which includes detailed tutorials and examples, is available on GitHub under the MIT license. The field of mineralogy and many other geoscience disciplines are facing opportunities enabled by open data. Various research topics such as mineral network analysis, mineral association rule mining, mineral ecology, mineral evolution, and critical minerals have already benefited from Mindat's open data efforts in recent years. We hope this R package can help accelerate those data-intensive studies and lead to more scientific discoveries.more » « less
An official website of the United States government

