skip to main content

Title: Reliable Biodiversity Dataset References,
No systematic approach has yet been adopted to reliably reference and provide access to digital biodiversity datasets. Based on accumulated evidence, we argue that location-based identifiers such as URLs are not sufficient to ensure long-term data access. We introduce a method that uses dedicated data observatories to evaluate long-term URL reliability. From March 2019 through May 2020, we took periodic inventories of the data provided to major biodiversity aggregators, including GBIF, iDigBio, DataONE, and BHL by accessing the URL-based dataset references from which the aggregators retrieve data. Over the period of observation, we found that, for the URL-based dataset references available in each of the aggregators' data provider registries, 5% to 70% of URLs were intermittently or consistently unresponsive, 0% to 66% produced unstable content, and 20% to 75% became either unresponsive or unstable. We propose the use of cryptographic hashing to generate content-based identifiers that can reliably reference datasets. We show that content-based identifiers facilitate decentralized archival and reliable distribution of biodiversity datasets to enable long-term accessibility of the referenced datasets.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
iDigBio Communications Luncheon
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. 10.17605/OSF.IO/AT4XE Despite increased use of digital biodiversity data in research, reliable methods to identify datasets are not widely adopted. While commonly used location-based dataset identifiers such as URLs help to easily download data today, additional identification schemes are needed to ensure long term access to datasets. We propose to augment existing location- and DOI-based identification schemes with cryptographic content-based identifiers. These content-based identifiers can be calculated from the datasets themselves using available cryptographic hashing algorithms (e.g., sha256). These algorithms take only the digital content as input to generate a unique identifier without needing a centralized identification administration. The use of content-based identifiers is not new, but a re-application of change management techniques used in the popular version control system "git". We show how content-based identifiers can be used to version datasets, to track the dataset locations, to monitor their reliability, and to efficiently detect dataset changes. We discuss the results of using our approach on datasets registered in GBIF and iDigBio from Sept 2018 to May 2020. Also, we propose how reliable, decentralized, dataset indexing and archiving systems can be devised. Lastly, we outline a modification to existing data citation practices to help work towards more reproducible and reusable research workflows. 
    more » « less
  2. Abstract

    EarthCube Data Discovery Studio (DDStudio) is a crossdomain geoscience data discovery and exploration portal. It indexes over 1.65 million metadata records harvested from 40+ sources and utilizes a configurable metadata augmentation pipeline to enhance metadata content, using text analytics and an integrated geoscience ontology. Metadata enhancers add keywords with identifiers that map resources to science domains, geospatial features, measured variables, and other characteristics. The pipeline extracts spatial location and temporal references from metadata to generate structured spatial and temporal extents, maintaining provenance of each metadata enhancement, and allowing user validation. The semantically enhanced metadata records are accessible as standard ISO 19115/19139 XML documents via standard search interfaces. A search interface supports spatial, temporal, and text‐based search, as well as functionality for users to contribute, standardize, and update resource descriptions, and to organize search results into shareable collections. DDStudio bridges resource discovery and exploration by letting users launch Jupyter notebooks residing on several platforms for any discovered datasets or dataset collection. DDStudio demonstrates how linking search results from the catalog directly to software tools and environments reduces time to science in a series of examples from several geoscience domains. URL:

    more » « less
  3. International collaboration between collections, aggregators, and researchers within the biodiversity community and beyond is becoming increasingly important in our efforts to support biodiversity, conservation and the life of the planet. The social, technical, logistical and financial aspects of an equitable biodiversity data landscape – from workforce training and mobilization of linked specimen data, to data integration, use and publication – must be considered globally and within the context of a growing biodiversity crisis. In recent years, several initiatives have outlined paths forward that describe how digital versions of natural history specimens can be extended and linked with associated data. In the United States, Webster (2017) presented the “extended specimen”, which was expanded upon by Lendemer et al. (2019) through the work of the Biodiversity Collections Network (BCoN). At the same time, a “digital specimen” concept was developed by DiSSCo in Europe (Hardisty 2020). Both the extended and digital specimen concepts depict a digital proxy of an analog natural history specimen, whose digital nature provides greater capabilities such as being machine-processable, linkages with associated data, globally accessible information-rich biodiversity data, improved tracking, attribution and annotation, additional opportunities for data use and cross-disciplinary collaborations forming the basis for FAIR (Findable, Accessible, Interoperable, Reproducible) and equitable sharing of benefits worldwide, and innumerable other advantages, with slight variation in how an extended or digital specimen model would be executed. Recognizing the need to align the two closely-related concepts, and to provide a place for open discussion around various topics of the Digital Extended Specimen (DES; the current working name for the joined concepts), we initiated a virtual consultation on the discourse platform hosted by the Alliance for Biodiversity Knowledge through GBIF. This platform provided a forum for threaded discussions around topics related and relevant to the DES. The goals of the consultation align with the goals of the Alliance for Biodiversity Knowledge: expand participation in the process, build support for further collaboration, identify use cases, identify significant challenges and obstacles, and develop a comprehensive roadmap towards achieving the vision for a global specification for data integration. In early 2021, Phase 1 launched with five topics: Making FAIR data for specimens accessible; Extending, enriching and integrating data; Annotating specimens and other data; Data attribution; and Analyzing/mining specimen data for novel applications. This round of full discussion was productive and engaged dozens of contributors, with hundreds of posts and thousands of views. During Phase 1, several deeper, more technical, or additional topics of relevance were identified and formed the foundation for Phase 2 which began in May 2021 with the following topics: Robust access points and data infrastructure alignment; Persistent identifier (PID) scheme(s); Meeting legal/regulatory, ethical and sensitive data obligations; Workforce capacity development and inclusivity; Transactional mechanisms and provenance; and Partnerships to collaborate more effectively. In Phase 2 fruitful progress was made towards solutions to some of these complex functional and technical long-term goals. Simultaneously, our commitment to open participation was reinforced, through increased efforts to involve new voices from allied and complementary fields. Among a wealth of ideas expressed, the community highlighted the need for unambiguous persistent identifiers and a dedicated agent to assign them, support for a fully linked system that includes robust publishing mechanisms, strong support for social structures that build trustworthiness of the system, appropriate attribution of legacy and new work, a system that is inclusive, removed from colonial practices, and supportive of creative use of biodiversity data, building a truly global data infrastructure, balancing open access with legal obligations and ethical responsibilities, and the partnerships necessary for success. These two consultation periods, and the myriad activities surrounding the online discussion, produced a wide variety of perspectives, strategies, and approaches to converging the digital and extended specimen concepts, and progressing plans for the DES -- steps necessary to improve access to research-ready data to advance our understanding of the diversity and distribution of life. Discussions continue and we hope to include your contributions to the DES in future implementation plans. 
    more » « less
  4. Phishing websites remain a persistent security threat. Thus far, machine learning approaches appear to have the best potential as defenses. But, there are two main concerns with existing machine learning approaches for phishing detection. The first is the large number of training features used and the lack of validating arguments for these feature choices. The second concern is the type of datasets used in the literature that are inadvertently biased with respect to the features based on the website URL or content. To address these concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. Accordingly, we design features that model the relationships, visual as well as statistical, of the domain name to the key elements of a phishing website, which are used to snare the end-users. The main value of our feature design is that, to bypass detection, an attacker will find it very difficult to tamper with the visual content of the phishing website without arousing the suspicion of the end user. Our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards specific datasets. We show the robustness of our learning algorithm by testing on unknown live phishing URLs and achieve a high detection accuracy of 99.7%. 
    more » « less
  5. Large systematic revisionary projects incorporating data for hundreds or thousands of taxa require an integrative approach, with a strong biodiversity-informatics core for efficient data management to facilitate research on the group. Our original biodiversity informatics platform, 3i (Internet-accessible Interactive Identification) combined a customized MS Access database backend with ASP-based web interfaces to support revisionary syntheses of several large genera of leafhopers (Hemiptera: Auchenorrhyncha: Cicadellidae). More recently, for our National Science Foundation sponsored project, “GoLife: Collaborative Research: Integrative genealogy, ecology and phenomics of deltocephaline leafhoppers (Hemiptera: Cicadellidae), and their microbial associates”, we selected the new open-source platform TaxonWorks as the cyberinfrastructure. In the scope of the project, the original “3i World Auchenorrhyncha Database” was imported into TaxonWorks. At the present time, TaxonWorks has many tools to automatically import nomenclature, citations, and specimen based collection data. At the time of the initial migration of the 3i database, many of those tools were still under development, and complexity of the data in the database required a custom migration script, which is still probably the most efficient solution for importing datasets with long development history. At the moment, the World Auchenorrhyncha Database comprehensively covers nomenclature of the group and includes data on 70 valid families, 6,816 valid genera, 47,064 valid species as well as synonymy and subsequent combinations (Fig. 1). In addition, many taxon records include the original citation, bibliography, type information, etymology, etc. The bibliography of the group includes 37,579 sources, about 1/3 of which are associated with PDF files. Species have distribution records, either derived from individual specimens or as country and state level asserted distribution, as well as biological associations indicating host plants, predators, and parasitoids. Observation matrices in TaxonWorks are designed to handle morphological data associated with taxa or specimens. The matrices may be used to automatically generate interactive identification keys and taxon descriptions. They can also be downloaded to be imported, for example, into Lucid builder, or to perform phylogenetic analysis using an external application. At the moment there are 36 matrices associated with the project. The observation matrix from GoLife project covers 798 taxa by 210 descriptors (most of which are qualitative multi-state morphological descriptors) (Fig. 2). Illustrations are provided for 9,886 taxa and organized in the specialized image matrix and could be used as a pictorial key for determination of species and taxa of a higher rank. For the phylogenetic analysis, a dataset was constructed for 730 terminal taxa and >160,000 nucleotide positions obtained using anchored hybrid enrichment of genomic DNA for a sample of leafhoppers from the subfamily Deltocephalinae and outgroups. The probe kit targets leafhopper genes, as well as some bacterial genes (endosymbionts and plant pathogens transmitted by leafhoppers). The maximum likelihood analyses of concatenated nucleotide and amino acid sequences as well as coalescent gene tree analysis yielded well-resolved phylogenetic trees (Cao et al. 2022). Raw sequence data have been uploaded to the Sequence Read Archive on GenBank. Occurrence and morphological data, as well as diagnostic images, for voucher specimens have been incorporated into TaxonWorks. Data in TaxonWorks could be exported in raw format, get accessed via Application Programming Interface (API), or be shared with external data aggregators like Catalogue of Life, GBIF, iDigBio. 
    more » « less