skip to main content


Title: Adaptive Ensembling: Unsupervised Domain Adaptation for Political Document Analysis
Insightful findings in political science often require researchers to analyze documents of a certain subject or type, yet these documents are usually contained in large corpora that do not distinguish between pertinent and non-pertinent documents. In contrast, we can find corpora that label relevant documents but have limitations (e.g., from a single source or era), preventing their use for political science research. To bridge this gap, we present adaptive ensembling, an unsupervised domain adaptation framework, equipped with a novel text classification model and time-aware training to ensure our methods work well with diachronic corpora. Experiments on an expert-annotated dataset show that our framework outperforms strong benchmarks. Further analysis indicates that our methods are more stable, learn better representations, and extract cleaner corpora for fine-grained analysis.  more » « less
Award ID(s):
1850153
NSF-PAR ID:
10165663
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Page Range / eLocation ID:
4717 to 4729
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract How international is political text-analysis research? In computational text analysis, corpus selection skews heavily toward English-language sources and reflects a Western bias that influences the scope, interpretation, and generalizability of research on international politics. For example, corpus selection bias can affect our understanding of alliances and alignments, internal dynamics of authoritarian regimes, durability of treaties, the onset of genocide, and the formation and dissolution of non-state actor groups. Yet, there are issues along the entire “value chain” of corpus production that affect research outcomes and the conclusions we draw about things in the world. I identify three issues in the data-generating process pertaining to discourse analysis of political phenomena: information deficiencies that lead to corpus selection and analysis bias; problems regarding document preparation, such as the availability and quality of corpora from non-English sources; and gaps in the linguist analysis pipeline. Short-term interventions for incentivizing this agenda include special journal issues, conference workshops, and mentoring and training students in international relations in this methodology. Longer term solutions to these issues include promoting multidisciplinary collaboration, training students in computational discourse methods, promoting foreign language proficiency, and co-authorship across global regions that may help scholars to learn more about global problems through primary documents. 
    more » « less
  2. The ubiquitous pollution of the environment with microplastics, a diverse suite of contaminants, is of growing concern for science and currently receives considerable public, political, and academic attention. The potential impact of microplastics in the environment has prompted a great deal of research in recent years. Many diverse methods have been developed to answer different questions about microplastic pollution, from sources, transport, and fate in the environment, and about effects on humans and wildlife. These methods are often insufficiently described, making studies neither comparable nor reproducible. The proliferation of new microplastic investigations and cross-study syntheses to answer larger scale questions are hampered. This diverse group of 23 researchers think these issues can begin to be overcome through the adoption of a set of reporting guidelines. This collaboration was created using an open science framework that we detail for future use. Here, we suggest harmonized reporting guidelines for microplastic studies in environmental and laboratory settings through all steps of a typical study, including best practices for reporting materials, quality assurance/quality control, data, field sampling, sample preparation, microplastic identification, microplastic categorization, microplastic quantification, and considerations for toxicology studies. We developed three easy to use documents, a detailed document, a checklist, and a mind map, that can be used to reference the reporting guidelines quickly. We intend that these reporting guidelines support the annotation, dissemination, interpretation, reviewing, and synthesis of microplastic research. Through open access licensing (CC BY 4.0), these documents aim to increase the validity, reproducibility, and comparability of studies in this field for the benefit of the global community. 
    more » « less
  3. Background: Text recycling (hereafter TR)—the reuse of one’s own textual materials from one document in a new document—is a common but hotly debated and unsettled practice in many academic disciplines, especially in the context of peer-reviewed journal articles. Although several analytic systems have been used to determine replication of text—for example, for purposes of identifying plagiarism—they do not offer an optimal way to compare documents to determine the nature and extent of TR in order to study and theorize this as a practice in different disciplines. In this article, we first describe TR as a common phenomenon in academic publishing, then explore the challenges associated with trying to study the nature and extent of TR within STEM disciplines. We then describe in detail the complex processes we used to create a system for identifying TR across large corpora of texts, and the sentence-level string-distance lexical methods used to refine and test the system (White & Joy, 2004). The purpose of creating such a system is to identify legitimate cases of TR across large corpora of academic texts in different fields of study, allowing meaningful cross-disciplinary comparisons in future analyses of published work. The findings from such investigations will extend and refine our understanding of discourse practices in academic and scientific settings. Literature Review: Text-analytic methods have been widely developed and implemented to identify reused textual materials for detecting plagiarism, and there is considerable literature on such methods. (Instead of taking up space detailing this literature, we point readers to several recent reviews: Gupta, 2016; Hiremath & Otari, 2014; and Meuschke & Gipp, 2013). Such methods include fingerprinting, term occurrence analysis, citation analysis (identifying similarity in references and citations), and stylometry (statistically comparing authors’ writing styles; see Meuschke & Gipp, 2013). Although TR occurs in a wide range of situations, recent debate has focused on recycling from one published research paper to another—particularly in STEM fields (see, for example, Andreescu, 2013; Bouville, 2008; Bretag & Mahmud, 2009; Roig, 2008; Scanlon, 2007). An important step in better understanding the practice is seeing how authors actually recycle material in their published work. Standard methods for detecting plagiarism are not directly suitable for this task, as the objective is not to determine the presence or absence of reuse itself, but to study the types and patterns of reuse, including materials that are syntactically but not substantively distinct—such as “patchwriting” (Howard, 1999). In the present account of our efforts to create a text-analytic system for determining TR, we take a conventional alphabetic approach to text, in part because we did not aim at this stage of our project to analyze non-discursive text such as images or other media. However, although the project adheres to conventional definitions of text, with a focus on lexical replication, we also subscribe to context-sensitive approaches to text production. The results of applying the system to large corpora of published texts can potentially reveal varieties in the practice of TR as a function of different discourse communities and disciplines. Writers’ decisions within what appear to be canonical genres are contingent, based on adherence to or deviation from existing rules and procedures if and when these actually exist. Our goal is to create a system for analyzing TR in groups of texts produced by the same authors in order to determine the nature and extent of TR, especially across disciplinary areas, without judgment of scholars’ use of the practice. 
    more » « less
  4. Automated event detection from news corpora is a crucial task towards mining fast-evolving structured knowledge. As real-world events have different granularities, from the top-level themes to key events and then to event mentions corresponding to concrete actions, there are generally two lines of research: (1) theme detection tries to identify from a news corpus major themes (e.g., “2019 Hong Kong Protests” versus “2020 U.S. Presidential Election”) which have very distinct semantics; and (2) action extraction aims to extract from a single document mention-level actions (e.g., “the police hit the left arm of the protester”) that are often too fine-grained for comprehending the real-world event. In this paper, we propose a new task, key event detection at the intermediate level, which aims to detect from a news corpus key events (e.g., HK Airport Protest on Aug. 12-14), each happening at a particular time/location and focusing on the same topic. This task can bridge event understanding and structuring and is inherently challenging because of (1) the thematic and temporal closeness of different key events and (2) the scarcity of labeled data due to the fast-evolving nature of news articles. To address these challenges, we develop an unsupervised key event detection framework, EvMine, that (1) extracts temporally frequent peak phrases using a novel ttf-itf score, (2) merges peak phrases into event-indicative feature sets by detecting communities from our designed peak phrase graph that captures document cooccurrences, semantic similarities, and temporal closeness signals, and (3) iteratively retrieves documents related to each key event by training a classifier with automatically generated pseudo labels from the event-indicative feature sets and refining the detected key events using the retrieved documents in each iteration. Extensive experiments and case studies show EvMine outperforms all the baseline methods and its ablations on two real-world news corpora. 
    more » « less
  5. When journalists cover a news story, they can cover the story from multiple angles or perspectives. These perspectives are called “frames,” and usage of one frame or another may influence public perception and opinion of the issue at hand. We develop a web-based system for analyzing frames in multilingual text documents. We propose and guide users through a five-step end-to-end computational framing analysis framework grounded in media framing theory in communication research. Users can use the framework to analyze multilingual text data, starting from the exploration of frames in user’s corpora and through review of previous framing literature (step 1-3) to frame classification (step 4) and prediction (step 5). The framework combines unsupervised and supervised machine learning and leverages a state-of-the-art (SoTA) multilingual language model, which can significantly enhance frame prediction performance while requiring a considerably small sample of manual annotations. Through the interactive website, anyone can perform the proposed computational framing analysis, making advanced computational analysis available to researchers without a programming background and bridging the digital divide within the communication research discipline in particular and the academic community in general. The system is available online at http://www.openframing.org, via an API http://www.openframing.org:5000/docs/, or through our GitHub page https://github.com/vibss2397/openFraming. 
    more » « less