Using Topic Modeling for Code Discovery in Large Scale Text Data

Cai, Zhiqiang; Siebert-Evenstone, Amanda; Eagan, Brendan R.; Shaffer, David W.

doi:https://doi.org/10.1007/978-3-030-67788-6_2

Citation Details

Using Topic Modeling for Code Discovery in Large Scale Text Data

When text datasets are very large, manually coding line by line becomes impractical. As a result, researchers sometimes try to use machine learning algorithms to automatically code text data. One of the most popular algorithms is topic modeling. For a given text dataset, a topic model provides probability distributions of words for a set of “topics” in the data, which researchers then use to interpret meaning of the topics. A topic model also gives each document in the dataset a score for each topic, which can be used as a non-binary coding for what proportion of a topic is in the document. Unfortunately, it is often difficult to interpret what the topics mean in a defensible way, or to validate document topic proportion scores as meaningful codes. In this study, we examine how keywords from codes developed by human experts were distributed in topics generated from topic modeling. The results show that (1) top keywords of a single topic often contain words from multiple human-generated codes; and conversely, (2) words from human-generated codes appear as high-probability keywords in multiple topic. These results explain why directly using topics from topic models as codes is problematic. However, they also imply that topic modeling makes it possible for researchers to discover codes from short word lists. more »

Award ID(s):: 1661036

PAR ID:: 10248616

Author(s) / Creator(s):: Cai, Zhiqiang; Siebert-Evenstone, Amanda; Eagan, Brendan R.; Shaffer, David W.

Editor(s):: Ruis, Andrew; Lee, Seung B.

Date Published:: 2021-01-01

Journal Name:: Advances in Quantitative Ethnography: Second International Conference, ICQE 2020, Malibu, CA, USA, February 1-3, 2021, Proceedings

Page Range / eLocation ID:: 18 - 31

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/https://doi.org/10.1007/978-3-030-67788-6_2

More Like this