skip to main content


Title: Author-Driven Computable Data and Ontology Production for Taxonomists
It takes great effort to manually or semi-automatically convert free-text phenotype narratives (e.g., morphological descriptions in taxonomic works) to a computable format before they can be used in large-scale analyses. We argue that neither a manual curation approach nor an information extraction approach based on machine learning is a sustainable solution to produce computable phenotypic data that are FAIR (Findable, Accessible, Interoperable, Reusable) (Wilkinson et al. 2016). This is because these approaches do not scale to all biodiversity, and they do not stop the publication of free-text phenotypes that would need post-publication curation. In addition, both manual and machine learning approaches face great challenges: the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other) in manual curation, and keywords to ontology concept translation in automated information extraction, make it difficult for either approach to produce data that are truly FAIR. Our empirical studies show that inter-curator variation in translating phenotype characters to Entity-Quality statements (Mabee et al. 2007) is as high as 40% even within a single project. With this level of variation, curated data integrated from multiple curation projects may still not be FAIR. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardized vocabularies (ontologies). We argue that the authors describing characters are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project's semantics and ontology. This will speed up ontology development and improve the semantic clarity of the descriptions from the moment of publication. In this presentation, we will introduce the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists, which consists of three components: a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. a web-based, ontology-aware software application called 'Character Recorder,' which features a spreadsheet as the data entry platform and provides authors with the flexibility of using their preferred terminology in recording characters for a set of specimens (this application also facilitates semantic clarity and consistency across species descriptions); a set of services that produce RDF graph data, collects terms added by authors, detects potential conflicts between terms, dispatches conflicts to the third component and updates the ontology with resolutions; and an Android mobile application, 'Conflict Resolver,' which displays ontological conflicts and accepts solutions proposed by multiple experts. Fig. 1 shows the system diagram of the platform. The presentation will consist of: a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. a report on the findings from a recent survey of 90+ participants on the need for a tool like Character Recorder; a methods section that describes how we provide semantics to an existing vocabulary of quantitative characters through a set of properties that explain where and how a measurement (e.g., length of perigynium beak) is taken. We also report on how a custom color palette of RGB values obtained from real specimens or high-quality specimen images, can be used to help authors choose standardized color descriptions for plant specimens; and a software demonstration, where we show how Character Recorder and Conflict Resolver can work together to construct both human-readable descriptions and RDF graphs using morphological data derived from species in the plant genus Carex (sedges). The key difference of this system from other ontology-aware systems is that authors can directly add needed terms to the ontology as they wish and can update their data according to ontology updates. The software modules currently incorporated in Character Recorder and Conflict Resolver have undergone formal usability studies. We are actively recruiting Carex experts to participate in a 3-day usability study of the entire system of the Platform for Author-Driven Computable Data and Ontology Production for Taxonomists. Participants will use the platform to record 100 characters about one Carex species. In addition to usability data, we will collect the terms that participants submit to the underlying ontology and the data related to conflict resolution. Such data allow us to examine the types and the quantities of logical conflicts that may result from the terms added by the users and to use Discrete Event Simulation models to understand if and how term additions and conflict resolutions converge. We look forward to a discussion on how the tools (Character Recorder is online at http://shark.sbs.arizona.edu/chrecorder/public) described in our presentation can contribute to producing and publishing FAIR data in taxonomic studies.  more » « less
Award ID(s):
1661485
NSF-PAR ID:
10339238
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Biodiversity Information Science and Standards
Volume:
5
ISSN:
2535-0897
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Taxonomic treatments start with the creation of taxon-by-character matrices. Systematics authors recognized data ambiguity issues in published phenotypic characters and are willing to adopt an ontology-aware authoring tool (Cui et al. 2022). To promote interoperable and reusable taxonomic treatments, we have developed two research prototypes: a web-based application, Character Recorder (http://chrecorder.lusites.xyz/login), to faciliate the use and addition of ontology terms by Carex systematist authors while building their matrices, and a mobile application, Conflict Resolver (Android, https://tinyurl.com/5cfatrz8), to identify potential conflicts among the terms added by the authors and facilitate the resolution of the conflicts. We have completed two usability studies on Character Recorder. a web-based application, Character Recorder (http://chrecorder.lusites.xyz/login), to faciliate the use and addition of ontology terms by Carex systematist authors while building their matrices, and a mobile application, Conflict Resolver (Android, https://tinyurl.com/5cfatrz8), to identify potential conflicts among the terms added by the authors and facilitate the resolution of the conflicts. We have completed two usability studies on Character Recorder. In the one-hour Student Usabiilty Study, 16 third-year biology students with a general introduction to Carex used Character Recorder and Excel to record a set of 11 given characters for two samples (shape of sheath summits = U-shaped/U shaped). In the three-day Expert Usability Study, 7 established Carex systematists and 1 graduate student with expert-level knowledge used Character Recorder to record characters for 1 sample each of Carex canesens and Carex rostrata as they would in their professional life, using real mounted specimens, microscope, reticles, and rulers. Experts activities were not timed but they spent roughly 1.5 days on recording the characters and the rest of time discussing features and improvements. Features of Character Recorder have been reported in 2021 TDWG meeting and we included here only a few figures to highlight its interoperability and reusability features at the time of the usability studies (Fig. 1, Fig. 2, and Fig. 3). The Carex Ontology accompanying Character Recorder was created by extracting terms from Carex treatments of Flora of China and Flora of North America using Explorer of Taxon Concept (Cui et al. 2016) with subsequent manual edits. The design principle of Character Recorder is to encourage standardization and also leave the authors the freedom to do their work. While it took students an average of 6 minutes to recover all the given characters using Microsoft® Excel®, as opposed to 11 minutes using Character Recorder, the total number of unique meaning-bearing words used in their characters was 116 with Excel versus 30 with Character Recorder, showing the power of the latter in reducing synonyms and spelling variations. All students reported that they learned to use Character Recorder quickly and some even thought their use was as fast or faster than using Excel. All preferred Character Recorder to Excel for teaching students to record character data. Nearly all of the students found Character Recorder was more useful for recording clear and consistent data and all students agreed that participating in this study raised their awareness of data variation issues. The expert group consisted of 3, 2, 1, 3 experts in age ranges 20-49, 50-59, 60-69, and >69, respectively. They each recorded over 100 characters for two or more samples. Detailed analysis of their characters is pending, but we have noticed color characters have more variations than other characters (Fig. 4). All experts reported that they learned to use Character Recorder quickly, and 6 out of 8 believed they would not need a tutorial the next time they used it. One out of 8 experts somewhat disliked the feature of reusing others' values ("Use This" in Fig. 2) as it may undermine the objectivity and independence of an author. All experts used Recommended Set of Characters and they liked the term suggestion and illustration features shown in Figs 2, 3. All experts would recommend that their colleagues try Character Recorder and recommended that it be further developed and integrated into every taxonomist's toolbox. Student and expert responses to the National Aeronautics and Space Administration Task Load Index (NASA-TLX, Hart and Staveland 1988) are summarized in Fig. 5, which suggests that, while Character Recorder may incur in a slightly higher cost, the performance it supports outweighs its cost, especially for students. Every piece of the software prototypes and associated resources are open for anyone to access or further develop. We thank all student and expert participants and US National Science Foundation for their support in this research. We thank Harris & Harris and Presses de l'Université Laval for the permissions to use their phenotype illustrations in Character Recorder. 
    more » « less
  2. Phenotypes are used for a multitude of purposes such as defining species, reconstructing phylogenies, diagnosing diseases or improving crop and animal productivity, but most of this phenotypic data is published in free-text narratives that are not computable. This means that the complex relationship between the genome, the environment and phenotypes is largely inaccessible to analysis and important questions related to the evolution of organisms, their diseases or their response to climate change cannot be fully addressed. It takes great effort to manually convert free-text narratives to a computable format before they can be used in large-scale analyses. We argue that this manual curation approach is not a sustainable solution to produce computable phenotypic data for three reasons: 1) it does not scale to all of biodiversity; 2) it does not stop the publication of free-text phenotypes that will continue to need manual curation in the future and, most importantly, 3) It does not solve the problem of inter-curator variation (curators interpret/convert a phenotype differently from each other). Our empirical studies have shown that inter-curator variation is as high as 40% even within a single project. With this level of variation, it is difficult to imagine that data integrated from multiple curation projects can be of high quality. The key causes of this variation have been identified as semantic vagueness in original phenotype descriptions and difficulties in using standardised vocabularies (ontologies). We argue that the authors describing phenotypes are the key to the solution. Given the right tools and appropriate attribution, the authors should be in charge of developing a project’s semantics and ontology. This will speed up ontology development and improve the semantic clarity of phenotype descriptions from the moment of publication. A proof of concept project on this idea was funded by NSF ABI in July 2017. We seek readers input or critique of the proposed approaches to help achieve community-based computable phenotype data production in the near future. Results from this project will be accessible through https://biosemantics.github.io/author-driven-production. 
    more » « less
  3. Background: When phenotypic characters are described in the literature, they may be constrained or clarified with additional information such as the location or degree of expression, these terms are called “modifiers”. With effort underway to convert narrative character descriptions to computable data, ontologies for such modifiers are needed. Such ontologies can also be used to guide term usage in future publications. Spatial and method modifiers are the subjects of ontologies that already have been developed or are under development. In this work, frequency (e.g., rarely, usually), certainty (e.g., probably, definitely), degree (e.g., slightly, extremely), and coverage modifiers (e.g., sparsely, entirely) are collected, reviewed, and used to create two modifier ontologies with different design considerations. The basic goal is to express the sequential relationships within a type of modifiers, for example, usually is more frequent than rarely, in order to allow data annotated with ontology terms to be classified accordingly. Method: Two designs are proposed for the ontology, both using the list pattern: a closed ordered list (i.e., five-bin design) and an open ordered list design. The five-bin design puts the modifier terms into a set of 5 fixed bins with interval object properties, for example, one_level_more/less_frequently_than, where new terms can only be added as synonyms to existing classes. The open list approach starts with 5 bins, but supports the extensibility of the list via ordinal properties, for example, more/less_frequently_than, allowing new terms to be inserted as a new class anywhere in the list. The consequences of the different design decisions are discussed in the paper. CharaParser was used to extract modifiers from plant, ant, and other taxonomic descriptions. After a manual screening, 130 modifier words were selected as the candidate terms for the modifier ontologies. Four curators/experts (three biologists and one information scientist specialized in biosemantics) reviewed and categorized the terms into 20 bins using the Ontology Term Organizer (OTO) (http://biosemantics.arizona.edu/OTO). Inter-curator variations were reviewed and expressed in the final ontologies. Results: Frequency, certainty, degree, and coverage terms with complete agreement among all curators were used as class labels or exact synonyms. Terms with different interpretations were either excluded or included using “broader synonym” or “not recommended” annotation properties. These annotations explicitly allow for the user to be aware of the semantic ambiguity associated with the terms and whether they should be used with caution or avoided. Expert categorization results showed that 16 out of 20 bins contained terms with full agreements, suggesting differentiating the modifiers into 5 levels/bins balances the need to differentiate modifiers and the need for the ontology to reflect user consensus. Two ontologies, developed using the Protege ontology editor, are made available as OWL files and can be downloaded from https://github.com/biosemantics/ontologies. Contribution: We built the first two modifier ontologies following a consensus-based approach with terms commonly used in taxonomic literature. The five-bin ontology has been used in the Explorer of Taxon Concepts web toolkit to compute the similarity between characters extracted from literature to facilitate taxon concepts alignments. The two ontologies will also be used in an ontology-informed authoring tool for taxonomists to facilitate consistency in modifier term usage. 
    more » « less
  4. Abstract

    Organismal anatomy is a hierarchical system of anatomical entities often imposing dependencies among multiple morphological characters. Ontologies provide a formal and computable framework for incorporating prior biological knowledge about anatomical dependencies in models of trait evolution. They also offer new opportunities for working with semantic representations of morphological data.

    In this work, we present a new R package—rphenoscate—that enables incorporating ontological knowledge in evolutionary analyses and exploring semantic patterns of morphological data. In conjunction withrphenoscape, it allows for assembling synthetic phylogenetic character matrices from semantic phenotypes of morphological data. We showcase the package functionality with data sets from bees and fishes.

    We demonstrate that ontologies can be employed to automatically set up evolutionary models accounting for trait dependencies in stochastic character mapping. We also demonstrate how ontology annotations can be explored to interrogate patterns of morphological evolution. Finally, we demonstrate that synthetic character matrices assembled from semantic phenotypes retain most of the phylogenetic information from their original data sets.

    Ontologies will become important tools for integrating anatomical knowledge into phylogenetic methods and making morphological data FAIR compliant—a critical step of the ongoing ‘phenomics’ revolution. Our new package offers key advancements towards this goal.

     
    more » « less
  5. Abstract Critical to answering large-scale questions in biology is the integration of knowledge from different disciplines into a coherent, computable whole. Controlled vocabularies such as ontologies represent a clear path toward this goal. Using survey questionnaires, we examined the attitudes of biologists toward adopting controlled vocabularies in phenotype publications. Our questions cover current experience and overall attitude with controlled vocabularies, the awareness of the issues around ambiguity and inconsistency in phenotype descriptions and post-publication professional data curation, the preferred solutions and the effort and desired rewards for adopting a new authoring workflow. Results suggest that although the existence of controlled vocabularies is widespread, their use is not common. A majority of respondents (74%) are frustrated with ambiguity in phenotypic descriptions, and there is a strong agreement (mean agreement score 4.21 out of 5) that author curation would better reflect the original meaning of phenotype data. Moreover, the vast majority (85%) of researchers would try a new authoring workflow if resultant data were more consistent and less ambiguous. Even more respondents (93%) suggested that they would try and possibly adopt a new authoring workflow if it required 5% additional effort as compared to normal, but higher rates resulted in a steep decline in likely adoption rates. Among the four different types of rewards, two types of citations were the most desired incentives for authors to produce computable data. Overall, our results suggest the adoption of a new authoring workflow would be accelerated by a user-friendly and efficient software-authoring tool, an increased awareness of the challenges text ambiguity creates for external curators and an elevated appreciation of the benefits of controlled vocabularies. 
    more » « less