skip to main content


Title: ConfliBERT: A Pre-trained Language Model for Political Conflict and Violence
Analyzing conflicts and political violence around the world is a persistent challenge in the political science and policy communities due in large part to the vast volumes of specialized text needed to monitor conflict and violence on a global scale. To help advance research in political science, we introduce ConfliBERT, a domain-specific pre-trained language model for conflict and political violence. We first gather a large domain-specific text corpus for language modeling from various sources. We then build ConfliBERT using two approaches: pre-training from scratch and continual pre-training. To evaluate ConfliBERT, we collect 12 datasets and implement 18 tasks to assess the models’ practical application in conflict research. Finally, we evaluate several versions of ConfliBERT in multiple experiments. Results consistently show that ConfliBERT outperforms BERT when analyzing political violence and conflict.  more » « less
Award ID(s):
1931541
NSF-PAR ID:
10470310
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
5469 to 5482
Format(s):
Medium: X
Location:
Seattle, United States
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent advances in natural language processing (NLP) and Big Data technologies have been crucial for scientists to analyze political unrest and violence, prevent harm, and promote global conflict management. Government agencies and public security organizations have invested heavily in deep learning-based applications to study global conflicts and political violence. However, such applications involving text classification, information extraction, and other NLP-related tasks require extensive human efforts in annotating/labeling texts. While limited labeled data may drastically hurt the models’ performance (over-fitting), large demands on annotation tasks may turn real-world applications impracticable. To address this problem, we propose Confli-T5, a prompt-based method that leverages the domain knowledge from existing political science ontology to generate synthetic but realistic labeled text samples in the conflict and mediation domain. Our model allows generating textual data from the ground up and employs our novel Double Random Sampling mechanism to improve the quality (coherency and consistency) of the generated samples. We conduct experiments over six standard datasets relevant to political science studies to show the superiority of Confli-T5. Our codes are publicly available 
    more » « less
  2. Extracting structured metadata from unstructured text in different domains is gaining strong attention from multiple research communities. In Political Science, these metadata play a significant role on studying intra and inter-state interactions between political entities. The process of extracting such metadata usually relies on domain specific ontologies and knowledge-based repositories. In particular, Political Scientists regularly use the well-defined ontology CAMEO, which is designed for capturing conflict and mediation relations. Since CAMEO repositories are currently human maintained, the high cost and extensive human effort associated with updating them makes it difficult to include new entries on a regular basis. This paper introduces HANKE: an innovative framework for automatically extracting knowledge representations from unstructured sources, in order to extend CAMEO ontology both in the same domain and towards other related domains in political science. HANKE combines Hierarchical Attention Networks as engine for identifying relevant structures in raw-text and the novel Frequency-Based Ranker approach to obtain a collection of candidate entries for CAMEO's repositories. To show the efficiency of the proposed framework, we evaluate its performance on capturing existing CAMEO representations in a soft-labelled dataset. We also empirically demonstrate the versatility and superiority of HANKE method by applying it to two case studies related to CAMEO extension on its actual domain and towards organized crime domain. 
    more » « less
  3. Knowledge discovery and extraction approaches attract special attention across industries and areas moving toward the 5V Era. In the political and social sciences, scholars and governments dedicate considerable resources to develop intelligent systems for monitoring, analyzing and predicting conflicts and affairs involving political entities across the globe. Such systems rely on background knowledge from external knowledge bases, that conflict experts commonly maintain manually. The high costs and extensive human efforts associated with updating and extending these repositories often compromise their correctness of. Here we introduce CoMe-KE (Conflict and Mediation Knowledge Extractor) to extend automatically knowledge bases about conflict and mediation events. We explore state-of-the-art natural language models to discover new political entities, their roles and status from news. We propose a distant supervised method and propose an innovative zero-shot approach based on a dynamic hypothesis procedure. Our methods leverage pre-trained models through transfer learning techniques to obtain excellent results with no need for a labeled data. Finally, we demonstrate the superiority of our method through a comprehensive set of experiments involving two study cases in the social sciences domain. CoMe-KE significantly outperforms the existing baseline, with (on average) double of the performance retrieving new political entities. 
    more » « less
  4. Political scientists and security agencies increasingly rely on computerized event data generation to track conflict processes and violence around the world. However, most of these approaches rely on pattern-matching techniques constrained by large dictionaries that are too costly to develop, update, or expand to emerging domains or additional languages. In this paper, we provide an effective solution to those challenges. Here we develop the 3M-Transformers (Multilingual, Multi-label, Multitask) approach for Event Coding from domain specific multilingual corpora, dispensing external large repositories for such task, and expanding the substantive focus of analysis to organized crime, an emerging concern for security research. Our results indicate that our 3M-Transformers configurations outperform state-of-the-art usual Transformers models (BERT and XLM-RoBERTa) for coding events on actors, actions and locations in English, Spanish, and Portuguese languages. 
    more » « less
  5. null (Ed.)
    Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses. 
    more » « less