skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: PropBank goes Public: Incorporation into Wikidata
This paper presents the first integration of PropBank role information into Wikidata, in order to provide a novel resource for information extraction, one combining Wikidata`s ontological metadata with PropBank`s rich argument structure encoding for event classes. We discuss a technique for PropBank augmentation to existing eventive Wikidata items, as well as identification of gaps in Wikidata`s coverage based on manual examination of over 11,300 PropBank rolesets. We propose five new Wikidata properties to integrate PropBank structure into Wikidata so that the annotated mappings can be added en masse. We then outline the methodology and challenges of this integration, including annotation with the combined resources.  more » « less
Award ID(s):
2019805
PAR ID:
10586899
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Editor(s):
Henning, S; Stede, M
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Format(s):
Medium: X
Location:
St. Julians, Malta
Sponsoring Org:
National Science Foundation
More Like this
  1. Vivi Nastase; Ellie Pavlick; Mohammad Taher Pilehvar; Jose Camacho-Collados; Alessandro Raganato (Ed.)
    This paper describes the evolution of the PropBank approach to semantic role labeling over the last two decades. During this time the PropBank frame files have been expanded to include non-verbal predicates such as adjectives, prepositions and multi-word expressions. The number of domains, genres and languages that have been PropBanked has also expanded greatly, creating an opportunity for much more challenging and robust testing of the generalization capabilities of PropBank semantic role labeling systems. We also describe the substantial effort that has gone into ensuring the consistency and reliability of the various annotated datasets and resources, to better support the training and evaluation of such systems 
    more » « less
  2. Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks and is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG). There is a strong need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to efficiently and accurately extract meaningful insights from CTI. We have created an initial unstructured CTI corpus from a variety of open sources that we are using to train and test cybersecurity entity models using the spaCy framework and exploring self-learning methods to automatically recognize cybersecurity entities. We also describe methods to apply cybersecurity domain entity linking with existing world knowledge from Wikidata. Our future work will survey and test spaCy NLP tools, and create methods for continuous integration of new information extracted from text. 
    more » « less
  3. Wikidata is a publicly available, crowdsourced knowledge base that contains interlinked concepts structured for use by intelligent systems. While Wikidata has experienced rapid growth, it is far from complete and faces challenges that prevent it from being used to its full potential. In this paper, we propose a novel method for improving Wikidata by engaging undergraduate students to contribute previously missing knowledge via concept mapping assignments. Rather than allow students to edit Wikidata directly, we describe a workflow in which knowledge is constructed by students and then reviewed by an expert. We present a case study in which we deployed a workflow in a large undergraduate course about sustainability, and find that it was able to contribute a substantial number of high quality statements that persisted in and contributed previously missing knowledge to Wikidata. This work provides a preliminary workflow for improving Wikidata based on classroom assignments, as well as recommendations for how future educational projects could continue to improve Wikidata or other public knowledge bases. 
    more » « less
  4. Structured data peer production (SDPP) platforms like Wikidata play an important role in knowledge production. Compared to traditional peer production platforms like Wikipedia, Wikidata data is more structured and intended to be used by machines, not (directly) by people; end-user interactions with Wikidata often happen through intermediary "invisible machines." Given this distinction, we wanted to understand Wikidata contributor motivations and how they are affected by usage invisibility caused by the machine intermediaries. Through an inductive thematic analysis of 15 interviews, we find that: (i) Wikidata editors take on two archetypes---Architects who define the ontological infrastructure of Wikidata, and Masons who build the database through data entry and editing; (ii) the structured nature of Wikidata reveals novel editor motivations, such as an innate drive for organizational work; (iii) most Wikidata editors have little understanding of how their contributions are used, which may demotivate some. We synthesize these insights to help guide the future design of SDPP platforms in supporting the engagement of different types of editors. 
    more » « less
  5. Bonial, Claire; Bonn, Julia; Hwang, Jena D (Ed.)
    We evaluate the ability of large language models (LLMs) to provide PropBank semantic role label annotations across different realizations of the same verbs in transitive, intransitive, and middle voice constructions. In order to assess the meta-linguistic capabilities of LLMs as well as their ability to glean such capabilities through in-context learning, we evaluate the models in a zero-shot setting, in a setting where it is given three examples of another verb used in transitive, intransitive, and middle voice constructions, and finally in a setting where it is given the examples as well as the correct sense and roleset information. We find that zero-shot knowledge of PropBank annotation is almost nonexistent. The largest model evaluated, GPT-4, achieves the best performance in the setting where it is given both examples and the correct roleset in the prompt, demonstrating that larger models can ascertain some meta-linguistic capabilities through in-context learning. However, even in this setting, which is simpler than the task of a human in PropBank annotation, the model achieves only 48% accuracy in marking numbered arguments correctly. To ensure transparency and reproducibility, we publicly release our dataset and model responses. 
    more » « less