We provide an overview and update on initiatives and approaches to add taxonomic data intelligence to distributed biodiversity knowledge networks. "Taxonomic intelligence" for biodiversity data is defined here as the ability to identify and renconcile source-contextualized taxonomic name-to-meaning relationships (Remsen 2016). We review the scientific opportunities, as well as information-technological and socio-economic pathways - both existing and envisioned - to embed de-centralized taxonomic data intelligence into the biodiversity data publication and knowledge intedgration processes. We predict that the success of this project will ultimately rest on our ability to up-value the roles and recognition of systematic expertise and experts in large, aggregated data environments. We will argue that these environments will need to adhere to criteria for responsible data science and interests of coherent communities of practice (Wenger 2000, Stoyanovich et al. 2017). This means allowing for fair, accountable, and transparent representation and propagation of evolving systematic knowledge and enduring or newly apparent conflict in systematic perspective (Sterner and Franz 2017, Franz and Sterner 2018, Sterner et al. 2019). We will demonstrate in principle and through concrete use cases, how to de-centralize systematic knowledge while maintaining alignments between congruent or concflicting taxonomic concept labels (Franz et al. 2016a, Franz et al. 2016b, Franz et al. 2019). The suggested approach uses custom-configured logic representation and reasoning methods, based on the Region Connection Calculus (RCC-5) alignment language. The approach offers syntactic consistency and semantic applicability or scalability across a wide range of biodiversity data products, ranging from occurrence records to phylogenomic trees. We will also illustrate how this kind of taxonomic data intelligence can be captured and propagated through existing or envisioned metadata conventions and standards (e.g., Senderov et al. 2018). Having established an intellectual opportunity, as well as a technical solution pathway, we turn to the issue of developing an implementation and adoption strategy. Which biodiversity data environments are currently the most taxonomically intelligent, and why? How is this level of taxonomic data intelligence created, maintained, and propagated outward? How are taxonomic data intelligence services motivated or incentivized, both at the level of individuals and organizations? Which "concerned entities" within the greater biodiversity data publication enterprise are best positioned to promote such services? Are the most valuable lessons for biodiversity data science "hidden" in successful social media applications? What are good, feasible, incremental steps towards improving taxonomic data intelligence for a diversity of data publishers?
more »
« less
Combining Machine Learning & Reasoning for Biodiversity Data Intelligence
The current crisis in global natural resource management makes it imperative that we better leverage the vast data sources associated with taxonomic entities (such as recognized species of plants and animals), which are known collectively as biodiversity data. However, these data pose considerable challenges for artificial intelligence: while growing rapidly in volume, they remain highly incomplete for many taxonomic groups, often show conflicting signals from different sources, and are multi-modal and therefore constantly changing in structure. In this paper, we motivate, describe, and present a novel workflow combining machine learning and automated reasoning, to discover patterns of taxonomic identity and change - i.e. “taxonomic intelligence” - leading to scalable and broadly impactful AI solutions within the bio-data realm.
more »
« less
- Award ID(s):
- 1827993
- PAR ID:
- 10302878
- Publisher / Repository:
- AAAI
- Date Published:
- Journal Name:
- Proceedings of the AAAI Conference on Artificial Intelligence
- Volume:
- 35
- Issue:
- 17
- ISSN:
- 2159-5399
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and include practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones. As a motivating case, consider the abundantly sampled North American deer mouse— Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; GBIF.org 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021). Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020).more » « less
-
null (Ed.)“What is crucial for your ability to communicate with me… pivots on the recipient’s capacity to interpret—to make good inferential sense of the meanings that the declarer is able to send” (Rescher 2000, p148). Conventional approaches to reconciling taxonomic information in biodiversity databases have been based on string matching for unique taxonomic name combinations (Kindt 2020, Norman et al. 2020). However, in their original context, these names pertain to specific usages or taxonomic concepts, which can subsequently vary for the same name as applied by different authors. Name-based synonym matching is a helpful first step (Guala 2016, Correia et al. 2018), but may still leave considerable ambiguity regarding proper usage (Fig. 1). Therefore, developing "taxonomic intelligence" is the bioinformatic challenge to adequately represent, and subsequently propagate, this complex name/usage interaction across trusted biodiversity data networks. How do we ensure that senders and recipients of biodiversity data not only can share messages but do so with “good inferential sense” of their respective meanings? Key obstacles have involved dealing with the complexity of taxonomic name/usage modifications through time, both in terms of accounting for and digitally representing the long histories of taxonomic change in most lineages. An important critique of proposals to use name-to-usage relationships for data aggregation has been the difficulty of scaling them up to reach comprehensive coverage, in contrast to name-based global taxonomic hierarchies (Bisby 2011). The Linnaean system of nomenclature has some unfortunate design limitations in this regard, in that taxonomic names are not unique identifiers, their meanings may change over time, and the names as a string of characters do not encode their proper usage, i.e., the name “Genus species” does not specify a source defining how to use the name correctly (Remsen 2016, Sterner and Franz 2017). In practice, many people provide taxonomic names in their datasets or publications but not a source specifying a usage. The information needed to map the relationships between names and usages in taxonomic monographs or revisions is typically not presented it in a machine-readable format. New approaches are making progress on these obstacles. Theoretical advances in the representation of taxonomic intelligence have made it increasingly possible to implement efficient querying and reasoning methods on name-usage relationships (Chen et al. 2014, Chawuthai et al. 2016, Franz et al. 2015). Perhaps most importantly, growing efforts to produce name-usage mappings on a medium scale by data providers and taxonomic authorities suggest an all-or-nothing approach is not required. Multiple high-profile biodiversity databases have implemented internal tools for explicitly tracking conflicting or dynamic taxonomic classifications, including eBird using concept relationships from AviBase (Lepage et al. 2014); NatureServe in its Biotics database; iNaturalist using its taxon framework (Loarie 2020); and the UNITE database for fungi (Nilsson et al. 2019). Other ongoing projects incorporating taxonomic intelligence include the Flora of Alaska (Flora of Alaska 2020), the Mammal Diversity Database (Mammal Diversity Database 2020) and PollardBase for butterfly population monitoring (Campbell et al. 2020).more » « less
-
Abstract Classification of the biological diversity on Earth is foundational to all areas of research within the natural sciences. Reliable biological nomenclatural and taxonomic systems facilitate efficient access to information about organisms and their names over time. However, broadly sharing, accessing, delivering, and updating these resources remains a persistent problem. This barrier has been acknowledged by the biodiversity data sharing community, yet concrete efforts to standardize and continually update taxonomic names in a sustainable way remain limited. High diversity groups such as arthropods are especially challenging as available specimen data per number of species is substantially lower than vertebrate or plant groups. The Terrestrial Parasite Tracker Thematic Collections Network project developed a workflow for gathering expert-verified taxonomic names across all available sources, aligning those sources, and publishing a single resource that provides a model for future endeavors to standardize digital specimen identification data. The process involved gathering expert-verified nomenclature lists representing the full taxonomic scope of terrestrial arthropod parasites, documenting issues experienced, and finding potential solutions for reconciliation of taxonomic resources against large data publishers. Although discordance between our expert resources and the Global Biodiversity Information Facility are relatively low, the impact across all taxa affects thousands of names that correspond to hundreds of thousands of specimen records. Here, we demonstrate a mechanism for the delivery and continued maintenance of these taxonomic resources, while highlighting the current state of taxon name curation for biodiversity data sharing.more » « less
-
The taxonomic foundation of a new regional flora or monograph is the reconciliation of pre-existing names and taxonomic concepts (i.e., variation in usage of those names). This reconciliation is traditionally done manually, but the availability of taxonomic resources online and of text manipulation software means that some of the work can now be automated, speeding up the development of new taxonomic products. As a contribution to developing a new Flora of Alaska (floraofalaska.org), we have digitized the main pre-existing flora (Hultén 1968) and combined it with key online taxonomic name sources (Panarctic Flora, Flora of North America, International Plant Names Index - IPNI, Tropicos, Kew’s World Checklist of Selected Plant Families), to build a canonical list of names anchored to external Globally Unique Identifiers (GUIDs) (e.g., IPNI URLs). We developed taxonomically-aware fuzzy-matching software ( matchnames , Webb 2020) to identify cognates in different lists. The taxa for which there are variations between different sources in accepted names and synonyms are then flagged for review by taxonomic experts. However, even though names may be consistent across previous monographs and floras, the taxonomic concept (or circumscription) of a name may differ among authors, meaning that the way an accepted name in the flora is applied may be unfamiliar to the users of previous floras. We therefore have begun to manually align taxonomic concepts across five existing floras: Panarctic Flora, Flora of North America, Cody’s Flora of the Yukon (Cody 2000), Welsh’s Flora (Welsh 1974) and Hultén’s Flora (Hultén 1968), analysing usage and recording the Region Connection Calculus (RCC-5) relationships between taxonomic concepts common to each source. So far, we have mapped taxa in 13 genera, containing 557 taxonomic concepts and 482 taxonomic concept relationships. To facilitate this alignment process we developed software ( tcm , Webb 2021) to record publications, names, taxonomic concepts and relationships, and to visualize the taxonomic concept relationships as graphs. These relationship graphs have proved to be accessible and valuable in discussing the frequently complex shifts in circumscription with the taxonomic experts who have reviewed the work. The taxonomic concept data are being integrated into the larger dataset to permit users of the new flora to instantly see both the chain of synonymy and concept map for any name. We have also worked with the developer of the Arctos Collection Management Solution (a database used for the majority of Alaskan collections) on new data tables for storage and display of taxonomic concept data. In this presentation, we will describe some of the ideas and workflows that may be of value to others working to connect across taxonomic resources.more » « less
An official website of the United States government

