Plant Science Knowledge Graph Corpus: a gold standard entity and relation corpus for the molecular plant sciences

Lotreck, Serena (ORCID:0000000172826272); Segura Abá, Kenia (ORCID:000000030329289X); Lehti-Shiu, Melissa D. (ORCID:0000000319852687); Seeger, Abigail (ORCID:0009000401496084); Brown, Brianna N. I. (ORCID:0000000226235583); Ranaweera, Thilanka (ORCID:0000000285664740); Schumacher, Ally (ORCID:0000000224131537); Ghassemi, Mohammad (ORCID:0000000151358588); Shiu, Shin-Han (ORCID:000000016470235X); Marshall-Colon, ed., Amy

doi:10.1093/insilicoplants/diad021

Abstract Natural language processing (NLP) techniques can enhance our ability to interpret plant science literature. Many state-of-the-art algorithms for NLP tasks require high-quality labelled data in the target domain, in which entities like genes and proteins, as well as the relationships between entities, are labelled according to a set of annotation guidelines. While there exist such datasets for other domains, these resources need development in the plant sciences. Here, we present the Plant ScIenCe KnowLedgE Graph (PICKLE) corpus, a collection of 250 plant science abstracts annotated with entities and relations, along with its annotation guidelines. The annotation guidelines were refined by iterative rounds of overlapping annotations, in which inter-annotator agreement was leveraged to improve the guidelines. To demonstrate PICKLE’s utility, we evaluated the performance of pretrained models from other domains and trained a new, PICKLE-based model for entity and relation extraction (RE). The PICKLE-trained models exhibit the second-highest in-domain entity performance of all models evaluated, as well as a RE performance that is on par with other models. Additionally, we found that computer science-domain models outperformed models trained on a biomedical corpus (GENIA) in entity extraction, which was unexpected given the intuition that biomedical literature is more similar to PICKLE than computer science. Upon further exploration, we established that the inclusion of new types on which the models were not trained substantially impacts performance. The PICKLE corpus is, therefore, an important contribution to training resources for entity and RE in the plant sciences.

More Like this