Ruis, Andrew
; Lee, Seung B.
(Ed.)
When text datasets are very large, manually coding line by line
becomes impractical. As a result, researchers sometimes try to use machine learning
algorithms to automatically code text data. One of the most popular algorithms
is topic modeling. For a given text dataset, a topic model provides probability distributions
of words for a set of “topics” in the data, which researchers then use
to interpret meaning of the topics. A topic model also gives each document in
the dataset a score for each topic, which can be used as a non-binary coding for
what proportion of a topic is in the document. Unfortunately, it is often difficult to
interpret what the topics mean in a defensible way, or to validate document topic
proportion scores as meaningful codes. In this study, we examine how keywords
from codes developed by human experts were distributed in topics generated from
topic modeling. The results show that (1) top keywords of a single topic often contain
words from multiple human-generated codes; and conversely, (2) words from
human-generated codes appear as high-probability keywords in multiple topic.
These results explain why directly using topics from topic models as codes is
problematic. However, they also imply that topic modeling makes it possible for
researchers to discover codes from short word lists.
more »
« less