Analyzing conflicts and political violence around the world is a persistent challenge in the political science and policy communities due in large part to the vast volumes of specialized text needed to monitor conflict and violence on a global scale. To help advance research in political science, we introduce ConfliBERT, a domain-specific pre-trained language model for conflict and political violence. We first gather a large domain-specific text corpus for language modeling from various sources. We then build ConfliBERT using two approaches: pre-training from scratch and continual pre-training. To evaluate ConfliBERT, we collect 12 datasets and implement 18 tasks to assess the models’ practical application in conflict research. Finally, we evaluate several versions of ConfliBERT in multiple experiments. Results consistently show that ConfliBERT outperforms BERT when analyzing political violence and conflict.
more »
« less
Bias in Text Analysis for International Relations Research
Abstract How international is political text-analysis research? In computational text analysis, corpus selection skews heavily toward English-language sources and reflects a Western bias that influences the scope, interpretation, and generalizability of research on international politics. For example, corpus selection bias can affect our understanding of alliances and alignments, internal dynamics of authoritarian regimes, durability of treaties, the onset of genocide, and the formation and dissolution of non-state actor groups. Yet, there are issues along the entire “value chain” of corpus production that affect research outcomes and the conclusions we draw about things in the world. I identify three issues in the data-generating process pertaining to discourse analysis of political phenomena: information deficiencies that lead to corpus selection and analysis bias; problems regarding document preparation, such as the availability and quality of corpora from non-English sources; and gaps in the linguist analysis pipeline. Short-term interventions for incentivizing this agenda include special journal issues, conference workshops, and mentoring and training students in international relations in this methodology. Longer term solutions to these issues include promoting multidisciplinary collaboration, training students in computational discourse methods, promoting foreign language proficiency, and co-authorship across global regions that may help scholars to learn more about global problems through primary documents.
more »
« less
- Award ID(s):
- 2117009
- PAR ID:
- 10352751
- Date Published:
- Journal Name:
- Global Studies Quarterly
- Volume:
- 2
- Issue:
- 3
- ISSN:
- 2634-3797
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Rich classroom discussion, or discourse, has long been a recommended pedagogical practice in K-12 math and science education. Research shows that discourse is beneficial for all learners, but especially for English learners and minoritized students in STEM. Discourse helps develop students' agency, academic language, and conceptual understanding. Some K-12 computer science (CS) curricula incorporate student discourse, but we believe it is under-used. In this paper, we review how discourse helps students learn, discuss the use of discourse in CS and math education, share ideas for promoting discourse in CS classrooms, and call on curriculum developers, teacher professional learning providers, and researchers to support the increased use of discourse in K-12 CS education.more » « less
-
Multilingual transformer language models have recently attracted much attention from researchers and are used in cross-lingual transfer learning for many NLP tasks such as text classification and named entity recognition.However, similar methods for transfer learning from monolingual text to code-switched text have not been extensively explored mainly due to the following challenges:(1) Code-switched corpus, unlike monolingual corpus, consists of more than one language and existing methods can’t be applied efficiently,(2) Code-switched corpus is usually made of resource-rich and low-resource languages and upon using multilingual pre-trained language models, the final model might bias towards resource-rich language. In this paper, we focus on code-switched sentiment analysis where we have a labelled resource-rich language dataset and unlabelled code-switched data. We propose a framework that takes the distinction between resource-rich and low-resource language into account.Instead of training on the entire code-switched corpus at once, we create buckets based on the fraction of words in the resource-rich language and progressively train from resource-rich language dominated samples to low-resource language dominated samples. Extensive experiments across multiple language pairs demonstrate that progressive training helps low-resource language dominated samples.more » « less
-
Pragmatic markers (PMs) are multifunctional elements that allow language users to organize and coordinate discourse and to express their attitudes and cognitive states. This study compares the discourse-pragmatic functions and distributional features of four PMs in Kwéyòl Donmnik (konsa ‘so’, èben ‘well’, papa/Bondyé ‘papa/God’, la ‘there’) with those of their etyma in French, Kwéyòl’s lexifier ((ou) comme ça, (eh) ben, bon Dieu, là), and with their counterparts in English, the colonial source language with which it has been in contact for more than two centuries (so, well, oh my God, there). The properties of the Kwéyòl PMs are determined through a corpus analysis and are then compared to descriptions of the French and English PMs in previous studies. Each of the four Kwéyòl PM’s has functions in common with its French etymon and its English counterpart as well as its own unique functions. In addition, English so performs functions in the Kwéyòl data that are unique to Kwéyòl konsa ‘so’, suggesting that so is being integrated into Kwéyòl. This study expands the limited body of work on Kwéyòl and deepens our understanding of language contact and Creole emergence at the discourse-pragmatic level, particularly in cases involving a second, non-lexifier colonial source language.more » « less
-
A common factor in bias measurement methods is the use of hand-curated seed lexicons, but there remains little guidance for their selection. We gather seeds used in prior work, documenting their common sources and rationales, and in case studies of three English-language corpora, we enumerate the different types of social biases and linguistic features that, once encoded in the seeds, can affect subsequent bias measurements. Seeds developed in one context are often re-used in other contexts, but documentation and evaluation remain necessary precursors to relying on seeds for sensitive measurements.more » « less
An official website of the United States government

