Political and social scientists monitor, analyze and predict political unrest and violence, preventing (or mitigating) harm, and promoting the management of global conflict. They do so using event coder systems, which extract structured representations from news articles to design forecast models and event-driven continuous monitoring systems. Existing methods rely on expensive manual annotated dictionaries and do not support multilingual settings. To advance the global conflict management, we propose a novel model, Multi-CoPED (Multilingual Multi-Task Learning BERT for Coding Political Event Data), by exploiting multi-task learning and state-of-the-art language models for coding multilingual political events. This eliminates the need for expensive dictionaries by leveraging BERT models' contextual knowledge through transfer learning. The multilingual experiments demonstrate the superiority of Multi-CoPED over existing event coders, improving the absolute macro-averaged F1-scores by 23.3% and 30.7% for coding events in English and Spanish corpus, respectively. We believe that such expressive performance improvements can help to reduce harms to people at risk of violence. 
                        more » 
                        « less   
                    
                            
                            Automatic Event Coding Framework for Spanish Political News Articles
                        
                    
    
            Today, Spanish speaking countries face widespread political crisis. These political conflicts are published in a large volume of Spanish news articles from Spanish agencies. Our goal is to create a fully functioning system that parses realtime Spanish texts and generates scalable event code. Rather than translating Spanish text into English text and using English event coders, we aim to create a tool that uses raw Spanish text and Spanish event coders for better flexibility, coverage, and cost.To accommodate the processing of a large number of Spanish articles, we adapt a distributed framework based on Apache Spark. We highlight how to extend the existing ontology to provide support for the automated coding process for Spanish texts. We also present experimental data to provide insight into the data collection process with filtering unrelated articles, scaling the framework, and gathering basic statistics on the dataset. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1931541
- PAR ID:
- 10376287
- Date Published:
- Journal Name:
- 2020 IEEE 6th Intl Conference on Big Data Security on Cloud (BigDataSecurity), IEEE Intl Conference on High Performance and Smart Computing, (HPSC) and IEEE Intl Conference on Intelligent Data and Security (IDS)
- Page Range / eLocation ID:
- 246 to 253
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Automated text simplification aims to produce simple versions of complex texts. This task is especially useful in the medical domain, where the latest medical findings are typically communicated via complex and technical articles. This creates barriers for laypeople seeking access to up-to-date medical findings, consequently impeding progress on health literacy. Most existing work on medical text simplification has focused on monolingual settings, with the result that such evidence would be available only in just one language (most often, English). This work addresses this limitation via multilingual simplification, i.e., directly simplifying complex texts into simplified texts in multiple languages. We introduce MultiCochrane, the first sentence-aligned multilingual text simplification dataset for the medical domain in four languages: English, Spanish, French, and Farsi. We evaluate fine-tuned and zero-shot models across these languages with extensive human assessments and analyses. Although models can generate viable simplified texts, we identify several outstanding challenges that this dataset might be used to address.more » « less
- 
            Three studies developed and validated a linguistic dictionary to measure negative affective polarization in English and Spanish political texts. It captures three dimensions: negative affect, delegitimization, and political context. In the first study, two independent judges evaluated the candidate words, and reliability indicators were calculated, showing acceptable values for short texts (.572 in English, .541 in Spanish) and higher values for larger corpora (.964 in English, .957 in Spanish). The second study tested discriminant validity by comparing negative affective polarization scores in social media comments on politics and entertainment. Results showed significantly higher polarization scores in political content, confirming the dictionary's validity. The third study compared the dictionary to an existing online polarization measure, finding greater coverage and alignment with the construct. Additionally, it was observed that polarization scores were higher in texts containing hate speech compared to those where it was absent. The findings suggest that the dictionary in both languages have strong psychometric properties, making it a valuable tool for analyzing online content, particularly social media comments. It can be used as an independent measure or as input for machine and deep learning models.more » « less
- 
            Political scientists and security agencies increasingly rely on computerized event data generation to track conflict processes and violence around the world. However, most of these approaches rely on pattern-matching techniques constrained by large dictionaries that are too costly to develop, update, or expand to emerging domains or additional languages. In this paper, we provide an effective solution to those challenges. Here we develop the 3M-Transformers (Multilingual, Multi-label, Multitask) approach for Event Coding from domain specific multilingual corpora, dispensing external large repositories for such task, and expanding the substantive focus of analysis to organized crime, an emerging concern for security research. Our results indicate that our 3M-Transformers configurations outperform state-of-the-art usual Transformers models (BERT and XLM-RoBERTa) for coding events on actors, actions and locations in English, Spanish, and Portuguese languages.more » « less
- 
            Ideology is at the core of political science research. Yet, there still does not exist general-purpose tools to characterize and predict ideology across different genres of text. To this end, we study Pretrained Language Models using novel ideology-driven pretraining objectives that rely on the comparison of articles on the same story written by media of different ideologies. We further collect a large-scale dataset, consisting of more than 3.6M political news articles, for pretraining. Our model POLITICS outperforms strong baselines and the previous state-of-the-art models on ideology prediction and stance detection tasks. Further analyses show that POLITICS is especially good at understanding long or formally written texts, and is also robust in few-shot learning scenarios.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    