Abstract How international is political text-analysis research? In computational text analysis, corpus selection skews heavily toward English-language sources and reflects a Western bias that influences the scope, interpretation, and generalizability of research on international politics. For example, corpus selection bias can affect our understanding of alliances and alignments, internal dynamics of authoritarian regimes, durability of treaties, the onset of genocide, and the formation and dissolution of non-state actor groups. Yet, there are issues along the entire “value chain” of corpus production that affect research outcomes and the conclusions we draw about things in the world. I identify three issues in the data-generating process pertaining to discourse analysis of political phenomena: information deficiencies that lead to corpus selection and analysis bias; problems regarding document preparation, such as the availability and quality of corpora from non-English sources; and gaps in the linguist analysis pipeline. Short-term interventions for incentivizing this agenda include special journal issues, conference workshops, and mentoring and training students in international relations in this methodology. Longer term solutions to these issues include promoting multidisciplinary collaboration, training students in computational discourse methods, promoting foreign language proficiency, and co-authorship across global regions that may help scholars to learn more about global problems through primary documents. 
                        more » 
                        « less   
                    
                            
                            Triangulation of Data Using Textual Data
                        
                    
    
            This essay describes essential considerations and select methods in computational text analysis for use in the study of history. We explore specific approaches that can be used for understanding conceptual change over time in a large corpus of documents. By way of example, using a corpus of 27,977 articles collected on the microbiome, this paper studies: 1) the general microbiome discourse for 2001 to 2010; 2) the usage and sense of the word “human” from 2001 to 2010; and 3) highlights shifts in the microbiome discourse from 2001 to 2010. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1734236
- PAR ID:
- 10093335
- Date Published:
- Journal Name:
- Isis
- Volume:
- 110
- Issue:
- 3
- ISSN:
- 0021-1753
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Much computational work has been done on identifying and interpreting the meaning of metaphors, but little work has been done on understanding the motivation behind the use of metaphor. To computationally model discourse and social positioning in metaphor, we need a corpus annotated with metaphors relevant to speaker intentions. This paper reports a corpus study as a first step towards computational work on social and discourse functions of metaphor. We use Amazon Mechanical Turk (MTurk) to annotate data from three web discussion forums covering distinct domains. We then compare these to annotations from our own annotation scheme which distinguish levels of metaphor with the labels: nonliteral, conventionalized, and literal. Our hope is that this work raises questions about what new work needs to be done in order to address the question of how metaphors are used to achieve social goals in interaction.more » « less
- 
            This paper presents a data-driven study focusing on analyzing and predicting sentence deletion — a prevalent but understudied phenomenon in document simplification — on a large English text simplification corpus. We inspect various document and discourse factors associated with sentence deletion, using a new manually annotated sentence alignment corpus we collected. We reveal that professional editors utilize different strategies to meet readability standards of elementary and middle schools. To predict whether a sentence will be deleted during simplification to a certain level, we harness automatically aligned data to train a classification model. Evaluated on our manually annotated data, our best models reached F1 scores of 65.2 and 59.7 for this task at the levels of elementary and middle school, respectively. We find that discourse level factors contribute to the challenging task of predicting sentence deletion for simplification.more » « less
- 
            While there has been substantial progress in text comprehension through simple factoid question answering, more holistic comprehension of a discourse still presents a major challenge (Dunietz et al., 2020). Someone critically reflecting on a text as they read it will pose curiosity-driven, often open-ended questions, which reflect deep understanding of the content and require complex reasoning to answer (Ko et al., 2020; Westera et al., 2020). A key challenge in building and evaluating models for this type of discourse comprehension is the lack of annotated data, especially since collecting answers to such questions requires high cognitive load for annotators. This paper presents a novel paradigm that enables scalable data collection targeting the comprehension of news documents, viewing these questions through the lens of discourse. The resulting corpus, DCQA (Discourse Comprehension by Question Answering), captures both discourse and semantic links between sentences in the form of free-form, open-ended questions. On an evaluation set that we annotated on questions from Ko et al. (2020), we show that DCQA provides valuable supervision for answering open-ended questions. We additionally design pre-training methods utilizing existing question-answering resources, and use synthetic data to accommodate unanswerable questions.more » « less
- 
            Research has shown that accounting for moral sentiment in natural language can yield insight into a variety of on- and off-line phenomena such as message diffusion, protest dynamics, and social distancing. However, measuring moral sentiment in natural language is challenging, and the difficulty of this task is exacerbated by the limited availability of annotated data. To address this issue, we introduce the Moral Foundations Twitter Corpus, a collection of 35,108 tweets that have been curated from seven distinct domains of discourse and hand annotated by at least three trained annotators for 10 categories of moral sentiment. To facilitate investigations of annotator response dynamics, we also provide psychological and demographic metadata for each annotator. Finally, we report moral sentiment classification baselines for this corpus using a range of popular methodologies.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    