skip to main content


This content will become publicly available on October 11, 2025

Title: An Information Ecosystem Map of Resources Supporting the Mobilization and Discovery of Paleontological Specimen Data

Over the last decade, the United States paleontological collections community has invested heavily in the digitization of specimen-based data, including over 10 million USD funded through the National Science Foundation’s Advancing Digitization of Biodiversity Collections program. Fossil specimen data—9.0 million records and counting (Global Biodiversity Information Facility 2024)—are now accessible on open science platforms such as the Global Biodiversity Information Facility (GBIF). However, the full potential of this data is far from realized due to fundamental challenges associated with mobilization, discoverability, and interoperability of paleontological information within the existing cyberinfrastructure landscape and data pipelines. Additionally, it can be difficult for individuals with varying expertise to develop a comprehensive understanding of the existing landscape due to its breadth and complexity. Here, we present preliminary results from a project aiming to explore how we might address these problems.

Funding from the US National Science Foundation (NSF) to the University of Colorado Museum of Natural History, Smithsonian National Museum of Natural History, and Arizona State University will result in, among other products, an “ecosystem map” for the paleontological collections community. This map will be an information-rich visualization of entities (e.g. concepts, systems, platforms, mechanisms, drivers, tools, documentation, data, standards, people, organizations) operating in, intersecting with, or existing in parallel to our domain. We are inspired and informed by similar efforts to map the biodiversity informatics landscape (Bingham et al. 2017) and the research infrastructure landscape (Distributed System of Scientific Collections 2024), as well as by many ongoing metadata cataloging projects, e.g. re3data and the Global Registry of Scientific Collections (GRSciColl). Our strategy for developing this ecosystem map is to model the existing information and systems landscape by characterizing entities, e.g. potentially in a graph database as nodes with relationships to other nodes.

The ecosystem map will enable us to provide guidance for communities workingacrossdifferent sectors of the landscape, promoting a shared understanding of the ecosystem that everyone works in together. We can also use the map to identify points of entry and engagement at various stages of the paleontological data process, and to engage diverse memberswithinthe paleontological community. We see three primary user types for this map: people new(er) to the community, people with expertise in a subset of the community, and people working to integrate initiatives and systems across communities. Each of these user types needs tailored access to the ecosystem map and its community knowledge. By promoting shared knowledge with the map, users will be able to identify their own space within the ecosystem and the connections or partnerships that they can utilize to expand their knowledge or resources, relieving the burden on any single individual to hold a comprehensive understanding.

For example, the flow of taxonomic information between publications, collections, digital resources, and biodiversity aggregators is not straightforward or easy to understand. A person with expertise in collections care may want to use the ecosystem map to understand why taxonomic identifications associated with their specimen occurrence records are showing up incorrectly when published to GBIF. We envision that our final ecosystem map will visualize the flow of taxonomic information and how it is used to interpret specimen occurrence data, thereby highlighting to this user where problems may be happening and whom to ask for help in addressing them (Fig. 1).

Ultimately, development of this map will allow us to identify mobilization pathways for paleontological data, highlight core cyberinfrastructure resources, define cyberinfrastructure gaps, strategize future partnerships, promote shared knowledge, and engage a broader array of expertise in the process. Contributing domain-based evidence FAIRly*2 requires expertise that bridges the content (e.g. paleontology) and the mechanics (e.g. informatics). By centering the role of humans in open science cyberinfrastructure throughout our process, we hope to develop systems that create and sustain such expertise.

 
more » « less
Award ID(s):
2324688 2324690 2324689
PAR ID:
10548331
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
PenSoft
Date Published:
Journal Name:
Biodiversity Information Science and Standards
Volume:
8
ISSN:
2535-0897
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Over 300 million arthropod specimens are housed in North American natural history collections. These collections represent a “vast hidden treasure trove” of biodiversity −95% of the specimen label data have yet to be transcribed for research, and less than 2% of the specimens have been imaged. Specimen labels contain crucial information to determine species distributions over time and are essential for understanding patterns of ecology and evolution, which will help assess the growing biodiversity crisis driven by global change impacts. Specimen images offer indispensable insight and data for analyses of traits, and ecological and phylogenetic patterns of biodiversity. Here, we review North American arthropod collections using two key metrics, specimen holdings and digitization efforts, to assess the potential for collections to provide needed biodiversity data. We include data from 223 arthropod collections in North America, with an emphasis on the United States. Our specific findings are as follows: (1) The majority of North American natural history collections (88%) and specimens (89%) are located in the United States. Canada has comparable holdings to the United States relative to its estimated biodiversity. Mexico has made the furthest progress in terms of digitization, but its specimen holdings should be increased to reflect the estimated higher Mexican arthropod diversity. The proportion of North American collections that has been digitized, and the number of digital records available per species, are both much lower for arthropods when compared to chordates and plants. (2) The National Science Foundation’s decade-long ADBC program (Advancing Digitization of Biological Collections) has been transformational in promoting arthropod digitization. However, even if this program became permanent, at current rates, by the year 2050 only 38% of the existing arthropod specimens would be digitized, and less than 1% would have associated digital images. (3) The number of specimens in collections has increased by approximately 1% per year over the past 30 years. We propose that this rate of increase is insufficient to provide enough data to address biodiversity research needs, and that arthropod collections should aim to triple their rate of new specimen acquisition. (4) The collections we surveyed in the United States vary broadly in a number of indicators. Collectively, there is depth and breadth, with smaller collections providing regional depth and larger collections providing greater global coverage. (5) Increased coordination across museums is needed for digitization efforts to target taxa for research and conservation goals and address long-term data needs. Two key recommendations emerge: collections should significantly increase both their specimen holdings and their digitization efforts to empower continental and global biodiversity data pipelines, and stimulate downstream research. 
    more » « less
  2. Abstract Natural history collections (NHCs) are the foundation of historical baselines for assessing anthropogenic impacts on biodiversity. Along these lines, the online mobilization of specimens via digitization—the conversion of specimen data into accessible digital content—has greatly expanded the use of NHC collections across a diversity of disciplines. We broaden the current vision of digitization (Digitization 1.0)—whereby specimens are digitized within NHCs—to include new approaches that rely on digitized products rather than the physical specimen (Digitization 2.0). Digitization 2.0 builds on the data, workflows, and infrastructure produced by Digitization 1.0 to create digital-only workflows that facilitate digitization, curation, and data links, thus returning value to physical specimens by creating new layers of annotation, empowering a global community, and developing automated approaches to advance biodiversity discovery and conservation. These efforts will transform large-scale biodiversity assessments to address fundamental questions including those pertaining to critical issues of global change. 
    more » « less
  3. Abstract Premise

    Digitized biodiversity data offer extensive information; however, obtaining and processing biodiversity data can be daunting. Complexities arise during data cleaning, such as identifying and removing problematic records. To address these issues, we created the R package Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs).

    Methods and Results

    The gatoRs workflow includes functions that streamline downloading records from the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio). We also created functions to clean downloaded specimen records. Unlike previous R packages, gatoRs accounts for differences in download structure between GBIF and iDigBio and allows for user control via interactive cleaning steps.

    Conclusions

    Our pipeline enables the scientific community to process biodiversity data efficiently and is accessible to the R coding novice. We anticipate that gatoRs will be useful for both established and beginning users. Furthermore, we expect our package will facilitate the introduction of biodiversity‐related concepts into the classroom via the use of herbarium specimens.

     
    more » « less
  4. Thanks to substantial support for biodiversity data mobilization in recent decades, billions of occurrence records are openly available, documenting life on Earth and enabling timely research, awareness raising, and policy-making. Initiatives across local to global scales have been separately funded to serve different, yet often overlapping audiences of data users, and have developed a variety of platforms and infrastructures to meet the needs of these audiences. The independent progress of biodiversity data providers has led to innovations as well as challenges for the community at large as we move towards connecting and linking a diversity of information from disparate sources as Digital Extended Specimens (DES).

    Recognizing a need for deeper and more frequent opportunities for communication and collaboration across the globe, an ad-hoc group of representatives of various international, national, and regional organizations have been meeting virtually since 2020 to provide a forum for updates, announcements, and shared progress. This group is provisionally named International Partners for the Digital Extended Specimen (IPDES), and is guided by these four concepts: Biodiversity, Connection, Knowledge and Agency. Participants in IPDES include representatives of the Global Biodiversity Information Facility (GBIF), Integrated Digitized Biocollections (iDigBio), American Institute of Biological Sciences (AIBS), Biodiversity Collections Network (BCoN), Natural Science Collections Alliance (NSCA), Distributed System of Scientific Collections (DiSSCo), Atlas of Living Australia (ALA), Biodiversity Information Standards (TDWG), Society for the Preservation of Natural History Collections (SPNHC), National Specimen Information Infrastructure of China (NSII), and South African National Biodiversity Institute (SANBI), as well as individuals involved with biodiversity informatics initiatives, natural science collections, museums, herbaria, and universities. Our global partners group strives to increase representation from around the globe as we aim to enable research that contributes to novel discoveries and addresses the societal challenges leading to the biodiversity crisis. Our overarching mission is to expand on the community-driven successes to connect biodiversity data and knowledge through coordination of a globally integrated network of stakeholders to enable an extensible technical and social infrastructure of data, tools, and working practices in support of our vision.

    The main work of our group thus far includes publishing a paper on the Digital Extended Specimen (Hardisty et al. 2022), organizing and hosting an array of activities at conferences, and asynchronous online work and forum-based exchanges. We aim to advance discussion on topics of broad interest to our community such as social and technical capacity building, broadening participation, expanding social and data networks, improving data models and building a backbone for the DES, and identifying international funding solutions.

    This presentation will highlight some of these activities and detail progress towards a roadmap for the development of the human network and technical infrastructure necessary to support the DES. It provides an opportunity for feedback from and engagement by stakeholder communities such as TDWG and other initiatives with a focus on data standards and biodiversity informatics, as we solidify our plans for the future in support of integrated and interconnected biodiversity data and credit for those doing the work.

     
    more » « less
  5. A stakeholder engagement workshop was held in May 2024 as part of the Community-driven enhancement of information ecosystems for the discovery and use of paleontological specimen data project, which is funded under the United States National Science Foundation (NSF) Geosciences Open Science Ecosystem (GEO OSE) program. This report describes the activites and outcomes of the workshop.

     
    more » « less