skip to main content


Title: TaxonWorks: Character state matrices and identification tools
TaxonWorks is an integrated web-based application for practicing taxonomists and biodiversity specialists. It is focused on promoting collaboration between researchers and developers. TaxonWorks has a modular structure that enables various components of the application to target specific needs and requirements of different groups of users. Specific areas of interest may include nomenclature-related tasks (Yoder and Dmitriev 2021) designed to help assemble and validate scientific name checklists of a target group of organisms; and collection management tasks, including interfaces to create, filter, and edit collecting events, collection objects, and loans. This presentation focuses on matrix-related tools integrated into TaxonWorks. A matrix, which could either be used for phylogenetic analysis or to build an identification key, is structured as a table where columns represent numerous characters that could be used to describe a set of entities, taxa or specimens (presented as rows of the table). Each cell of the table may contain observations for specific character/entity combinations. TaxonWorks does not generate a table for each a particular matrix—all observations are stored as graphs. This structure allows building of a matrix of an unlimited size as well as reuse of individual observations in multiple matrices. For matrix columns, TaxonWorks supports a variety of different kinds of characters or descriptors: qualitative, presence/absence, quantitative, sample, gene, free text, and media. Each character may have specific properties, for example a qualitative descriptor may have numerous characters states, and a quantitative descriptor may have a measurement unit defined. For an entity in a matrix row, TaxonWorks supports either collection objects (specimens) or taxa as Operational Taxonomic Units (OTU). OTUs could either be linked to nomenclature or be stand alone entities (e.g., representing undescribed species). The matrix, once built, could serve several purposes. A matrix based on qualitative and quantitative characters could be used to build an interactive key (Fig. 1), construct standardized natural language descriptions for each entity, and determine a diagnosis (a minimal set of characters that separate one entity from all others). It could also be exported and used for phylogenetic analysis or to build an interactive key in an external application. TaxonWorks supports export files in several formats, including Nexus, TNT, NeXML. Application Programming Interfaces (API) are also available. A matrix based on media descriptors could be used as a pictorial identification tool (Fig. 2).  more » « less
Award ID(s):
1639601
NSF-PAR ID:
10312517
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Biodiversity Information Science and Standards
Volume:
5
ISSN:
2535-0897
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Large systematic revisionary projects incorporating data for hundreds or thousands of taxa require an integrative approach, with a strong biodiversity-informatics core for efficient data management to facilitate research on the group. Our original biodiversity informatics platform, 3i (Internet-accessible Interactive Identification) combined a customized MS Access database backend with ASP-based web interfaces to support revisionary syntheses of several large genera of leafhopers (Hemiptera: Auchenorrhyncha: Cicadellidae). More recently, for our National Science Foundation sponsored project, “GoLife: Collaborative Research: Integrative genealogy, ecology and phenomics of deltocephaline leafhoppers (Hemiptera: Cicadellidae), and their microbial associates”, we selected the new open-source platform TaxonWorks as the cyberinfrastructure. In the scope of the project, the original “3i World Auchenorrhyncha Database” was imported into TaxonWorks. At the present time, TaxonWorks has many tools to automatically import nomenclature, citations, and specimen based collection data. At the time of the initial migration of the 3i database, many of those tools were still under development, and complexity of the data in the database required a custom migration script, which is still probably the most efficient solution for importing datasets with long development history. At the moment, the World Auchenorrhyncha Database comprehensively covers nomenclature of the group and includes data on 70 valid families, 6,816 valid genera, 47,064 valid species as well as synonymy and subsequent combinations (Fig. 1). In addition, many taxon records include the original citation, bibliography, type information, etymology, etc. The bibliography of the group includes 37,579 sources, about 1/3 of which are associated with PDF files. Species have distribution records, either derived from individual specimens or as country and state level asserted distribution, as well as biological associations indicating host plants, predators, and parasitoids. Observation matrices in TaxonWorks are designed to handle morphological data associated with taxa or specimens. The matrices may be used to automatically generate interactive identification keys and taxon descriptions. They can also be downloaded to be imported, for example, into Lucid builder, or to perform phylogenetic analysis using an external application. At the moment there are 36 matrices associated with the project. The observation matrix from GoLife project covers 798 taxa by 210 descriptors (most of which are qualitative multi-state morphological descriptors) (Fig. 2). Illustrations are provided for 9,886 taxa and organized in the specialized image matrix and could be used as a pictorial key for determination of species and taxa of a higher rank. For the phylogenetic analysis, a dataset was constructed for 730 terminal taxa and >160,000 nucleotide positions obtained using anchored hybrid enrichment of genomic DNA for a sample of leafhoppers from the subfamily Deltocephalinae and outgroups. The probe kit targets leafhopper genes, as well as some bacterial genes (endosymbionts and plant pathogens transmitted by leafhoppers). The maximum likelihood analyses of concatenated nucleotide and amino acid sequences as well as coalescent gene tree analysis yielded well-resolved phylogenetic trees (Cao et al. 2022). Raw sequence data have been uploaded to the Sequence Read Archive on GenBank. Occurrence and morphological data, as well as diagnostic images, for voucher specimens have been incorporated into TaxonWorks. Data in TaxonWorks could be exported in raw format, get accessed via Application Programming Interface (API), or be shared with external data aggregators like Catalogue of Life, GBIF, iDigBio. 
    more » « less
  2. It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components: a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge. We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies. 
    more » « less
  3. Taxonomic treatments start with the creation of taxon-by-character matrices. Systematics authors recognized data ambiguity issues in published phenotypic characters and are willing to adopt an ontology-aware authoring tool (Cui et al. 2022). To promote interoperable and reusable taxonomic treatments, we have developed two research prototypes: a web-based application, Character Recorder (http://chrecorder.lusites.xyz/login), to faciliate the use and addition of ontology terms by Carex systematist authors while building their matrices, and a mobile application, Conflict Resolver (Android, https://tinyurl.com/5cfatrz8), to identify potential conflicts among the terms added by the authors and facilitate the resolution of the conflicts. We have completed two usability studies on Character Recorder. a web-based application, Character Recorder (http://chrecorder.lusites.xyz/login), to faciliate the use and addition of ontology terms by Carex systematist authors while building their matrices, and a mobile application, Conflict Resolver (Android, https://tinyurl.com/5cfatrz8), to identify potential conflicts among the terms added by the authors and facilitate the resolution of the conflicts. We have completed two usability studies on Character Recorder. In the one-hour Student Usabiilty Study, 16 third-year biology students with a general introduction to Carex used Character Recorder and Excel to record a set of 11 given characters for two samples (shape of sheath summits = U-shaped/U shaped). In the three-day Expert Usability Study, 7 established Carex systematists and 1 graduate student with expert-level knowledge used Character Recorder to record characters for 1 sample each of Carex canesens and Carex rostrata as they would in their professional life, using real mounted specimens, microscope, reticles, and rulers. Experts activities were not timed but they spent roughly 1.5 days on recording the characters and the rest of time discussing features and improvements. Features of Character Recorder have been reported in 2021 TDWG meeting and we included here only a few figures to highlight its interoperability and reusability features at the time of the usability studies (Fig. 1, Fig. 2, and Fig. 3). The Carex Ontology accompanying Character Recorder was created by extracting terms from Carex treatments of Flora of China and Flora of North America using Explorer of Taxon Concept (Cui et al. 2016) with subsequent manual edits. The design principle of Character Recorder is to encourage standardization and also leave the authors the freedom to do their work. While it took students an average of 6 minutes to recover all the given characters using Microsoft® Excel®, as opposed to 11 minutes using Character Recorder, the total number of unique meaning-bearing words used in their characters was 116 with Excel versus 30 with Character Recorder, showing the power of the latter in reducing synonyms and spelling variations. All students reported that they learned to use Character Recorder quickly and some even thought their use was as fast or faster than using Excel. All preferred Character Recorder to Excel for teaching students to record character data. Nearly all of the students found Character Recorder was more useful for recording clear and consistent data and all students agreed that participating in this study raised their awareness of data variation issues. The expert group consisted of 3, 2, 1, 3 experts in age ranges 20-49, 50-59, 60-69, and >69, respectively. They each recorded over 100 characters for two or more samples. Detailed analysis of their characters is pending, but we have noticed color characters have more variations than other characters (Fig. 4). All experts reported that they learned to use Character Recorder quickly, and 6 out of 8 believed they would not need a tutorial the next time they used it. One out of 8 experts somewhat disliked the feature of reusing others' values ("Use This" in Fig. 2) as it may undermine the objectivity and independence of an author. All experts used Recommended Set of Characters and they liked the term suggestion and illustration features shown in Figs 2, 3. All experts would recommend that their colleagues try Character Recorder and recommended that it be further developed and integrated into every taxonomist's toolbox. Student and expert responses to the National Aeronautics and Space Administration Task Load Index (NASA-TLX, Hart and Staveland 1988) are summarized in Fig. 5, which suggests that, while Character Recorder may incur in a slightly higher cost, the performance it supports outweighs its cost, especially for students. Every piece of the software prototypes and associated resources are open for anyone to access or further develop. We thank all student and expert participants and US National Science Foundation for their support in this research. We thank Harris & Harris and Presses de l'Université Laval for the permissions to use their phenotype illustrations in Character Recorder. 
    more » « less
  4. We are now over four decades into digitally managing the names of Earth's species. As the number of federating (i.e., software that brings together previously disparate projects under a common infrastructure, for example TaxonWorks) and aggregating (e.g., International Plant Name Index, Catalog of Life (CoL)) efforts increase, there remains an unmet need for both the migration forward of old data, and for the production of new, precise and comprehensive nomenclatural catalogs. Given this context, we provide an overview of how TaxonWorks seeks to contribute to this effort, and where it might evolve in the future. In TaxonWorks, when we talk about governed names and relationships, we mean it in the sense of existing international codes of nomenclature (e.g., the International Code of Zoological Nomenclature (ICZN)). More technically, nomenclature is defined as a set of objective assertions that describe the relationships between the names given to biological taxa and the rules that determine how those names are governed. It is critical to note that this is not the same thing as the relationship between a name and a biological entity, but rather nomenclature in TaxonWorks represents the details of the (governed) relationships between names. Rather than thinking of nomenclature as changing (a verb commonly used to express frustration with biological nomenclature), it is useful to think of nomenclature as a set of data points, which grows over time. For example, when synonymy happens, we do not erase the past, but rather record a new context for the name(s) in question. The biological concept changes, but the nomenclature (names) simply keeps adding up. Behind the scenes, nomenclature in TaxonWorks is represented by a set of nodes and edges, i.e., a mathematical graph, or network (e.g., Fig. 1). Most names (i.e., nodes in the network) are what TaxonWorks calls "protonyms," monomial epithets that are used to construct, for example, bionomial names (not to be confused with "protonym" sensu the ICZN). Protonyms are linked to other protonyms via relationships defined in NOMEN, an ontology that encodes governed rules of nomenclature. Within the system, all data, nodes and edges, can be cited, i.e., linked to a source and therefore anchored in time and tied to authorship, and annotated with a variety of annotation types (e.g., notes, confidence levels, tags). The actual building of the graphs is greatly simplified by multiple user-interfaces that allow scientists to review (e.g. Fig. 2), create, filter, and add to (again, not "change") the nomenclatural history. As in any complex knowledge-representation model, there are outlying scenarios, or edge cases that emerge, making certain human tasks more complex than others. TaxonWorks is no exception, it has limitations in terms of what and how some things can be represented. While many complex representations are hidden by simplified user-interfaces, some, for example, the handling of the ICZN's Family-group name, batch-loading of invalid relationships, and comparative syncing against external resources need more work to simplify the processes presently required to meet catalogers' needs. The depth at which TaxonWorks can capture nomenclature is only really valuable if it can be used by others. This is facilitated by the application programming interface (API) serving its data (https://api.taxonworks.org), serving text files, and by exports to standards like the emerging Catalog of Life Data Package. With reference to real-world problems, we illustrate different ways in which the API can be used, for example, as integrated into spreadsheets, through the use of command line scripts, and serve in the generation of public-facing websites. Behind all this effort are an increasing number of people recording help videos, developing documentation, and troubleshooting software and technical issues. Major contributions have come from developers at many skill levels, from high school to senior software engineers, illustrating that TaxonWorks leads in enabling both technical and domain-based contributions. The health and growth of this community is a key factor in TaxonWork's potential long-term impact in the effort to unify the names of Earth's species. 
    more » « less
  5. TaxonWorks (http://taxonworks.org) in an integrated, open-source, cybertaxonomic web application serving taxonomists and biodiversity scientists. It is designed to facilitate efficient data capture, storage, manipulation, and retrieval. It integrates a wide variety of data types used by biodiversity scientists, including, but not limited to, taxonomy (with validation based on codes of zoological, botanical, bacterial, and viral nomenclature), specimen data, bibliographies, media (images, PDFs, sounds, videos), morphology (character/trait matrices), distribution, biological associations. Available TaxonWorks web interfaces currently provide various data entry forms for simple and advanced querying of the database. TaxonWorks has integrated batch uploader functionality. But, for larger datasets, specialized migration scripts were used. Several projects, historically build in 3i (http://dmitriev.speciesfile.org), MX (http://mx.phenomix.org), SpeciesFiles (http://software.speciesfile.org), and other databases, have been or are being migrated into TaxonWorks. Of the projects moving into TaxonWorks, it is worth mentioning several: 3i World Auchenorrhyncha Database, LepIndex, Universal Chalcidoidea Database, Orthoptera SpeciesFile, Plecoptera SpeciesFile, Illinois Natural History Survey Insect Collection database, and several others. An experience of the data migration will be shared during the presentation. 
    more » « less