skip to main content

Title: Avenues into Integration: Communicating taxonomic intelligence from sender to recipient
“What is crucial for your ability to communicate with me… pivots on the recipient’s capacity to interpret—to make good inferential sense of the meanings that the declarer is able to send” (Rescher 2000, p148). Conventional approaches to reconciling taxonomic information in biodiversity databases have been based on string matching for unique taxonomic name combinations (Kindt 2020, Norman et al. 2020). However, in their original context, these names pertain to specific usages or taxonomic concepts, which can subsequently vary for the same name as applied by different authors. Name-based synonym matching is a helpful first step (Guala 2016, Correia et al. 2018), but may still leave considerable ambiguity regarding proper usage (Fig. 1). Therefore, developing "taxonomic intelligence" is the bioinformatic challenge to adequately represent, and subsequently propagate, this complex name/usage interaction across trusted biodiversity data networks. How do we ensure that senders and recipients of biodiversity data not only can share messages but do so with “good inferential sense” of their respective meanings? Key obstacles have involved dealing with the complexity of taxonomic name/usage modifications through time, both in terms of accounting for and digitally representing the long histories of taxonomic change in most lineages. An important critique of proposals to more » use name-to-usage relationships for data aggregation has been the difficulty of scaling them up to reach comprehensive coverage, in contrast to name-based global taxonomic hierarchies (Bisby 2011). The Linnaean system of nomenclature has some unfortunate design limitations in this regard, in that taxonomic names are not unique identifiers, their meanings may change over time, and the names as a string of characters do not encode their proper usage, i.e., the name “Genus species” does not specify a source defining how to use the name correctly (Remsen 2016, Sterner and Franz 2017). In practice, many people provide taxonomic names in their datasets or publications but not a source specifying a usage. The information needed to map the relationships between names and usages in taxonomic monographs or revisions is typically not presented it in a machine-readable format. New approaches are making progress on these obstacles. Theoretical advances in the representation of taxonomic intelligence have made it increasingly possible to implement efficient querying and reasoning methods on name-usage relationships (Chen et al. 2014, Chawuthai et al. 2016, Franz et al. 2015). Perhaps most importantly, growing efforts to produce name-usage mappings on a medium scale by data providers and taxonomic authorities suggest an all-or-nothing approach is not required. Multiple high-profile biodiversity databases have implemented internal tools for explicitly tracking conflicting or dynamic taxonomic classifications, including eBird using concept relationships from AviBase (Lepage et al. 2014); NatureServe in its Biotics database; iNaturalist using its taxon framework (Loarie 2020); and the UNITE database for fungi (Nilsson et al. 2019). Other ongoing projects incorporating taxonomic intelligence include the Flora of Alaska (Flora of Alaska 2020), the Mammal Diversity Database (Mammal Diversity Database 2020) and PollardBase for butterfly population monitoring (Campbell et al. 2020). « less
Authors:
; ; ;
Award ID(s):
1754731
Publication Date:
NSF-PAR ID:
10204125
Journal Name:
Biodiversity Information Science and Standards
Volume:
4
ISSN:
2535-0897
Sponsoring Org:
National Science Foundation
More Like this
  1. Making the most of biodiversity data requires linking observations of biological species from multiple sources both efficiently and accurately (Bisby 2000, Franz et al. 2016). Aggregating occurrence records using taxonomic names and synonyms is computationally efficient but known to experience significant limitations on accuracy when the assumption of one-to-one relationships between names and biological entities breaks down (Remsen 2016, Franz and Sterner 2018). Taxonomic treatments and checklists provide authoritative information about the correct usage of names for species, including operational representations of the meanings of those names in the form of range maps, reference genetic sequences, or diagnostic traits. They increasingly provide taxonomic intelligence in the form of precise description of the semantic relationships between different published names in the literature. Making this authoritative information Findable, Accessible, Interoperable, and Reusable (FAIR; Wilkinson et al. 2016) would be a transformative advance for biodiversity data sharing and help drive adoption and novel extensions of existing standards such as the Taxonomic Concept Schema and the OpenBiodiv Ontology (Kennedy et al. 2006, Senderov et al. 2018). We call for the greater, global Biodiversity Information Standards (TDWG) and taxonomy community to commit to extending and expanding on how FAIR applies to biodiversity data and includemore »practical targets and criteria for the publication and digitization of taxonomic concept representations and alignments in taxonomic treatments, checklists, and backbones. As a motivating case, consider the abundantly sampled North American deer mouse— Peromyscus maniculatus (Wagner 1845)—which was recently split from one continental species into five more narrowly defined forms, so that the name P. maniculatus is now only applied east of the Mississippi River (Bradley et al. 2019, Greenbaum et al. 2019). That single change instantly rendered ambiguous ~7% of North American mammal records in the Global Biodiversity Information Facility (n=242,663, downloaded 2021-06-04; GBIF.org 2021) and ⅓ of all National Ecological Observatory Network (NEON) small mammal samples (n=10,256, downloaded 2021-06-27). While this type of ambiguity is common in name-based databases when species are split, the example of P. maniculatus is particularly striking for its impact upon biological questions ranging from hantavirus surveillance in North America to studies of climate change impacts upon rodent life-history traits. Of special relevance to NEON sampling is recent evidence suggesting deer mice potentially transmit SARS-CoV-2 (Griffin et al. 2021). Automating the updating of occurrence records in such cases and others will require operational representations of taxonomic concepts—e.g., range maps, reference sequences, and diagnostic traits—that are FAIR in addition to taxonomic concept alignment information (Franz and Peet 2009). Despite steady progress, it remains difficult to find, access, and reuse authoritative information about how to apply taxonomic names even when it is already digitized. It can also be difficult to tell without manual inspection whether similar types of concept representations derived from multiple sources, such as range maps or reference sequences selected from different research articles or checklists, are in fact interoperable for a particular application. The issue is therefore different from important ongoing efforts to digitize trait information in species circumscriptions, for example, and focuses on how already digitized knowledge can best be packaged to inform human experts and artifical intelligence applications (Sterner and Franz 2017). We therefore propose developing community guidelines and criteria for FAIR taxonomic concept representations as "semantic artefacts" of general relevance to linked open data and life sciences research (Le Franc et al. 2020).« less
  2. The taxonomic foundation of a new regional flora or monograph is the reconciliation of pre-existing names and taxonomic concepts (i.e., variation in usage of those names). This reconciliation is traditionally done manually, but the availability of taxonomic resources online and of text manipulation software means that some of the work can now be automated, speeding up the development of new taxonomic products. As a contribution to developing a new Flora of Alaska (floraofalaska.org), we have digitized the main pre-existing flora (Hultén 1968) and combined it with key online taxonomic name sources (Panarctic Flora, Flora of North America, International Plant Names Index - IPNI, Tropicos, Kew’s World Checklist of Selected Plant Families), to build a canonical list of names anchored to external Globally Unique Identifiers (GUIDs) (e.g., IPNI URLs). We developed taxonomically-aware fuzzy-matching software ( matchnames , Webb 2020) to identify cognates in different lists. The taxa for which there are variations between different sources in accepted names and synonyms are then flagged for review by taxonomic experts. However, even though names may be consistent across previous monographs and floras, the taxonomic concept (or circumscription) of a name may differ among authors, meaning that the way an accepted name in themore »flora is applied may be unfamiliar to the users of previous floras. We therefore have begun to manually align taxonomic concepts across five existing floras: Panarctic Flora, Flora of North America, Cody’s Flora of the Yukon (Cody 2000), Welsh’s Flora (Welsh 1974) and Hultén’s Flora (Hultén 1968), analysing usage and recording the Region Connection Calculus (RCC-5) relationships between taxonomic concepts common to each source. So far, we have mapped taxa in 13 genera, containing 557 taxonomic concepts and 482 taxonomic concept relationships. To facilitate this alignment process we developed software ( tcm , Webb 2021) to record publications, names, taxonomic concepts and relationships, and to visualize the taxonomic concept relationships as graphs. These relationship graphs have proved to be accessible and valuable in discussing the frequently complex shifts in circumscription with the taxonomic experts who have reviewed the work. The taxonomic concept data are being integrated into the larger dataset to permit users of the new flora to instantly see both the chain of synonymy and concept map for any name. We have also worked with the developer of the Arctos Collection Management Solution (a database used for the majority of Alaskan collections) on new data tables for storage and display of taxonomic concept data. In this presentation, we will describe some of the ideas and workflows that may be of value to others working to connect across taxonomic resources.« less
  3. We provide an overview and update on initiatives and approaches to add taxonomic data intelligence to distributed biodiversity knowledge networks. "Taxonomic intelligence" for biodiversity data is defined here as the ability to identify and renconcile source-contextualized taxonomic name-to-meaning relationships (Remsen 2016). We review the scientific opportunities, as well as information-technological and socio-economic pathways - both existing and envisioned - to embed de-centralized taxonomic data intelligence into the biodiversity data publication and knowledge intedgration processes. We predict that the success of this project will ultimately rest on our ability to up-value the roles and recognition of systematic expertise and experts in large, aggregated data environments. We will argue that these environments will need to adhere to criteria for responsible data science and interests of coherent communities of practice (Wenger 2000, Stoyanovich et al. 2017). This means allowing for fair, accountable, and transparent representation and propagation of evolving systematic knowledge and enduring or newly apparent conflict in systematic perspective (Sterner and Franz 2017, Franz and Sterner 2018, Sterner et al. 2019). We will demonstrate in principle and through concrete use cases, how to de-centralize systematic knowledge while maintaining alignments between congruent or concflicting taxonomic concept labels (Franz et al. 2016a, Franz etmore »al. 2016b, Franz et al. 2019). The suggested approach uses custom-configured logic representation and reasoning methods, based on the Region Connection Calculus (RCC-5) alignment language. The approach offers syntactic consistency and semantic applicability or scalability across a wide range of biodiversity data products, ranging from occurrence records to phylogenomic trees. We will also illustrate how this kind of taxonomic data intelligence can be captured and propagated through existing or envisioned metadata conventions and standards (e.g., Senderov et al. 2018). Having established an intellectual opportunity, as well as a technical solution pathway, we turn to the issue of developing an implementation and adoption strategy. Which biodiversity data environments are currently the most taxonomically intelligent, and why? How is this level of taxonomic data intelligence created, maintained, and propagated outward? How are taxonomic data intelligence services motivated or incentivized, both at the level of individuals and organizations? Which "concerned entities" within the greater biodiversity data publication enterprise are best positioned to promote such services? Are the most valuable lessons for biodiversity data science "hidden" in successful social media applications? What are good, feasible, incremental steps towards improving taxonomic data intelligence for a diversity of data publishers?« less
  4. Translating information between the domains of systematics and conservation requires novel information management designs. Such designs should improve interactions across the trading zone between the domains, herein understood as the model according to which knowledge and uncertainty are productively translated in both directions (cf. Collins et al. 2019). Two commonly held attitudes stand in the way of designing a well-functioning systematics-to-conservation trading zone. On one side, there are calls to unify the knowledge signal produced by systematics, underpinned by the argument that such unification is a necessary precondition for conservation policy to be reliably expressed and enacted (e.g., Garnett et al. 2020). As a matter of legal scholarship, the argument for systematic unity by legislative necessity is principally false (Weiss 2003, MacNeil 2009, Chromá 2011), but perhaps effective enough as a strategy to win over audiences unsure about robust law-making practices in light of variable and uncertain knowledge. On the other side, there is an attitude that conservation cannot ever restrict the academic freedom of systematics as a scientific discipline (e.g., Raposo et al. 2017). This otherwise sound argument misses the mark in the context of designing a productive trading zone with conservation. The central interactional challenge is not whethermore »the systematic knowledge can vary at a given time and/or evolve over time, but whether these signal dynamics are tractable in ways that actors can translate into robust maxims for conservation. Redesigning the trading zone should rest on the (historically validated) projection that systematics will continue to attract generations of inspired, productive researchers and broad-based societal support, frequently leading to protracted conflicts and dramatic shifts in how practioners in the field organize and identify organismal lineages subject to conservation. This confident outlook for systematics' future, in turn, should refocus the challenge of designing the trading zone as one of building better information services to model the concurrent conflicts and longer-term evolution of systematic knowledge. It would seem unreasonable to expect the International Union for Conservation of Nature (IUCN) Red List Index to develop better data science models for the dynamics of systematic knowledge (cf. Hoffmann et al. 2011) than are operational in the most reputable information systems designed and used by domain experts (Burgin et al. 2018). The reasonable challenge from conservation to systematics is not to stop being a science but to be a better data science. In this paper, we will review advances in biodiversity data science in relation to representing and reasoning over changes in systematic knowledge with computational logic, i.e., modeling systematic intelligence (Franz et al. 2016). We stress-test this approach with a use case where rapid systematic signal change and high stakes for conservation action intersect, i.e., the Malagasy mouse lemurs ( Microcebus É. Geoffroy, 1834 sec. Schüßler et al. 2020), where the number of recognized species-level concepts has risen from 2 to 25 in the span of 38 years (1982–2020). As much as scientifically defensible, we extend our modeling approach to the level of individual published occurrence records, where the inability to do so sometimes reflects substandard practice but more importantly reveals systemic inadequacies in biodiversity data science or informational modeling. In the absence of shared, sound theoretical foundations to assess taxonomic congruence or incongruence across treatments, and in the absence of biodiversity data platforms capable of propagating logic-enabled, scalable occurrence-to-concept identification events to produce alternative and succeeding distribution maps, there is no robust way to provide a knowledge signal from systematics to conservation that is both consistent in its syntax and acccurate in its semantics, in the sense of accurately reflecting the variation and uncertainty that exists across multiple systematic perspectives. Translating this diagnosis into new designs for the trading zone is only one "half" of the solution, i.e., a technical advancement that then would need to be socially endorsed and incentivized by systematic and conservation communities motivated to elevate their collaborative interactions and trade robustly in inherently variable and uncertain information.« less
  5. All life on earth is linked by a shared evolutionary history. Even before Darwin developed the theory of evolution, Linnaeus categorized types of organisms based on their shared traits. We now know these traits derived from these species’ shared ancestry. This evolutionary history provides a natural framework to harness the enormous quantities of biological data being generated today. The Open Tree of Life project is a collaboration developing tools to curate and share evolutionary estimates (phylogenies) covering the entire tree of life (Hinchliff et al. 2015, McTavish et al. 2017). The tree is viewable at https://tree.opentreeoflife.org, and the data is all freely available online. The taxon identifiers used in the Open Tree unified taxonomy (Rees and Cranston 2017) are mapped to identifiers across biological informatics databases, including the Global Biodiversity Information Facility (GBIF), NCBI, and others. Linking these identifiers allows researchers to easily unify data from across these different resources (Fig. 1). Leveraging a unified evolutionary framework across the diversity of life provides new avenues for integrative wide scale research. Downstream tools, such as R packages developed by the R OpenSci foundation (rotl, rgbif) (Michonneau et al. 2016, Chamberlain 2017) and others tools (Revell 2012), make accessing and combining thismore »information straightforward for students as well as researchers (e.g. https://mctavishlab.github.io/BIO144/labs/rotl-rgbif.html). Figure 1. Example linking phylogenetic relationships accessed from the Open Tree of Life with specimen location data from Global Biodiversity Information Facility. For example, a recent publication by Santorelli et al. 2018 linked evolutionary information from Open Tree with species locality data gathered from a local field study as well as GBIF species location records to test a river-barrier hypothesis in the Amazon. By combining these data, the authors were able test a widely held biogeographic hypothesis across 1952 species in 14 taxonomic groups, and found that a river that had been postulated to drive endemism, was in fact not a barrier to gene flow. However, data provenance and taxonomic name reconciliation remain key hurdles to applying data from these large digital biodiversity and evolution community resources to answering biological questions. In the Amazonian river analysis, while they leveraged use of GBIF records as a secondary check on their species records, they relied on their an intensive local field study for their major conclusions, and preferred taxon specific phylogenetic resources over Open Tree where they were available (Santorelli et al. 2018). When Li et al. 2018 assessed large scale phylogenetic approaches, including Open Tree, for measuring community diversity, they found that synthesis phylogenies were less resolved than purpose-built phylogenies, but also found that these synthetic phylogenies were sufficient for community level phylogenetic diversity analyses. Nonetheless, data quality concerns have limited adoption of analyses data from centralized resources (McTavish et al. 2017). Taxonomic name recognition and reconciliation across databases also remains a hurdle for large scale analyses, despite several ongoing efforts to improve taxonomic interoperability and unify taxonomies, such at Catalogue of Life + (Bánki et al. 2018). In order to support innovative science, large scale digital data resources need to facilitate data linkage between resources, and address researchers' data quality and provenance concerns. I will present the model that the Open Tree of Life is using to provide evolutionary data at the scale of the entire tree of life, while maintaining traceable provenance to the publications and taxonomies these evolutionary relationships are inferred from. I will discuss the hurdles to adoption of these large scale resources by researchers, as well as the opportunities for new research avenues provided by the connections between evolutionary inferences and biodiversity digital databases.« less