skip to main content


Title: textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data [textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data]
Award ID(s):
1934925 1934494
NSF-PAR ID:
10280456
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 10th International Conference on Data Science, Technology and Applications
Page Range / eLocation ID:
60 to 70
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ruis, Andrew ; Lee, Seung B. (Ed.)
    When text datasets are very large, manually coding line by line becomes impractical. As a result, researchers sometimes try to use machine learning algorithms to automatically code text data. One of the most popular algorithms is topic modeling. For a given text dataset, a topic model provides probability distributions of words for a set of “topics” in the data, which researchers then use to interpret meaning of the topics. A topic model also gives each document in the dataset a score for each topic, which can be used as a non-binary coding for what proportion of a topic is in the document. Unfortunately, it is often difficult to interpret what the topics mean in a defensible way, or to validate document topic proportion scores as meaningful codes. In this study, we examine how keywords from codes developed by human experts were distributed in topics generated from topic modeling. The results show that (1) top keywords of a single topic often contain words from multiple human-generated codes; and conversely, (2) words from human-generated codes appear as high-probability keywords in multiple topic. These results explain why directly using topics from topic models as codes is problematic. However, they also imply that topic modeling makes it possible for researchers to discover codes from short word lists. 
    more » « less
  2. null (Ed.)
  3. Protest event analysis is an important method for the study of collective action and social movements and typically draws on traditional media reports as the data source. We introduce collective action from social media (CASM)—a system that uses convolutional neural networks on image data and recurrent neural networks with long short-term memory on text data in a two-stage classifier to identify social media posts about offline collective action. We implement CASM on Chinese social media data and identify more than 100,000 collective action events from 2010 to 2017 (CASM-China). We evaluate the performance of CASM through cross-validation, out-of-sample validation, and comparisons with other protest data sets. We assess the effect of online censorship and find it does not substantially limit our identification of events. Compared to other protest data sets, CASM-China identifies relatively more rural, land-related protests and relatively few collective action events related to ethnic and religious conflict. 
    more » « less