<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Authorless Topic Models: Biasing Models Away from Known Structure</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2018</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10092208</idno>
					<idno type="doi"></idno>
					<title level='j'>COLING</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Laure Thompson</author><author>David Mimno</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Most previous work in unsupervised semantic modeling in the presence of metadata has assumed that our goal is to make latent dimensions more correlated with metadata, but in practice the exact opposite is often true. Some users want topic models that highlight differences between, for example, authors, but others seek more subtle connections across authors. We introduce three metrics for identifying topics that are highly correlated with metadata, and demonstrate that this problem affects between 30 and 50% of the topics in models trained on two real-world collections, regardless of the size of the model. We find that we can predict which words cause this phenomenon and that by selectively subsampling these words we dramatically reduce topic-metadata correlation, improve topic stability, and maintain or even improve model quality]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Unsupervised semantic models are a popular and useful method for inferring low-dimensional representations of large text collections. Examples of such models include latent semantic analysis <ref type="bibr">(Deerwester et al., 1990)</ref> and word embeddings <ref type="bibr">(Bengio et al., 2003)</ref>, but for this work we will focus on statistical topic models <ref type="bibr">(Hofmann, 1999;</ref><ref type="bibr">Blei et al., 2002)</ref>, which are used to infer word distributions that correspond to recognizable themes. In practice, collections are often constructed by combining documents from multiple sources, which may have distinctive style and vocabulary. This heterogeneity of sources leads to a serious but rarely studied problem: the strongest, most prominent patterns in a collection may simply repeat the known structure of the corpus. Instead of finding informative, cross-cutting themes, models simply repeat the distinctive vocabulary of the individual sources. The model in this case is "correct" in that it has detected the strongest dimensions of variation, but it tells us nothing we did not already know.</p><p>As a motivating example, we focus on models trained on novels, where it is known that inferred topics are often simply names of characters and settings <ref type="bibr">(Jockers, 2013)</ref>. The words Harry, Ron, and Hermione look to the algorithm like the basis of an ideal topic because they occur very frequently together but not in other contexts. But this topic only tells us which books within a larger corpus are part of the Harry Potter series; themes like friendship, adolescence, and magic remain hidden. This phenomenon is not limited to fiction: we also include a case study of opinions from US state supreme courts. Unlike examples from fiction, Maine and Utah both exist in the same universe, but exhibit specific regional term use.</p><p>We begin by demonstrating that the problem of overly source-specific topics is both substantial and measurable. We present three metrics that provide related but distinct views of source specificity. These metrics are orthogonal to existing metrics of topic semantic quality: uselessly source-specific topics are often still highly coherent and meaningful. These metrics are also inversely related to commonly-used document classification evaluations. Learning 20 newsgroup-specific topics from 20 Newsgroups may be informative as an evaluation, but in practice users are rarely unaware of such structure.</p><p>Finally, we present a simple but effective method for reducing the prevalence of source-specific topics. This method relies on probabilistically subsampling words that correlate with known source metadata, and is related to subsampling methods that have been highly effective in word embeddings <ref type="bibr">(Mikolov et al., 2013;</ref><ref type="bibr">Levy et al., 2015)</ref>. The best of the proposed methods substantially reduces source-specific topics, increases topic differentiation without increasing model complexity, and improves topic stability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>The common assumption of prior work on metadata-aware topic modeling has been that metadata provides valuable hints that can be used to improve topics. Several methods use document metadata to influence document-level topic distributions. The author-topic model <ref type="bibr">(Rosen-Zvi et al., 2004)</ref>, relational topic model <ref type="bibr">(Chang and Blei, 2009)</ref>, and labeled LDA <ref type="bibr">(Ramage et al., 2009)</ref> extend LDA by directly incorporating a particular type of metadata (e.g. author information, document links, user-generated tags) into the model. Others, like factorial LDA <ref type="bibr">(Paul and Dredze, 2012)</ref>, Dirichlet-multinomial regression topic models <ref type="bibr">(Mimno and McCallum, 2008)</ref>, and structural topic models <ref type="bibr">(Roberts et al., 2014)</ref> incorporate more general categories of metadata. All of these aim to increase dependence between topics and metadata. In contrast, our goal is to make topics independent of specified metadata.</p><p>Other research makes topic-word distributions sensitive to document-level metadata. The special words with background model <ref type="bibr">(Chemudugunta et al., 2006)</ref> incorporates document-specific word distributions into LDA, while cross-collection LDA <ref type="bibr">(Paul, 2009)</ref> incorporates collection level word distributions. The topic-aspect model <ref type="bibr">(Paul and Girju, 2010)</ref> extends LDA to include a mixture of aspects of documents such that aspects affect all topics similarly. Although these models may be able to sequester author-specific words, there is no reason to expect that those words will not also drag along general, cross-cutting words.</p><p>In this paper we focus on ways to explicitly identify words that bias topics towards a specific metadata tag and modify the input corpus for an algorithm to reduce their effect. Researchers have often dismissed this sort of data curation as unprincipled and heuristic "preprocessing." More recent work <ref type="bibr">(Denny and Spirling, 2016;</ref><ref type="bibr">Boyd-Graber et al., 2014)</ref> emphasizes that meta-algorithms for data preparation can greatly affect the intrinsic model quality and human interpretability of topic models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Collections and Models</head><p>We collected two real-world corpora that combine text from multiple distinct sources: science fiction novels and U.S. state supreme court opinions. Science Fiction (SCI-FI). We selected 1206 science fiction novels by 245 authors based on award nominations and curated book lists hosted on Worlds Without End. <ref type="foot">2</ref> We consider each author as a source, and treat collaborations as distinct sources. We augmented the corpus with other established authors to increase the diversity of author gender and ethnicity. The novels span from the early 1800s to the present day. Most of these works are currently protected by copyright, so rather than full text we obtained page-level word frequency statistics from the HathiTrust Research Center's Extracted Features Dataset <ref type="bibr">(Capitanu et al., 2016)</ref>. This data indicates, for example, that page 227 of Dune contains one instance of the word storm as a noun. Following previous work (Jockers, 2013) we divide volume-length works into page-level segments, omitting headers and footers.</p><p>U.S. State Supreme Courts (COURTS). Each U.S. state has a supreme court that decides appeals for decisions made by lower state courts. In this collection each document is a court opinion, written by the court after the completion of a case, summarizes the case and judgment. We treat each state court as a source, expecting that courts use geographically specific language (e.g. Colorado, Denver, Colo., Boulder)</p><p>that is not relevant to the legal content of opinions. We examine court opinions for all 50 state supreme courts for cases filed from 2012 through 2016.<ref type="foot">foot_2</ref> </p><p>Data Preparation. We apply the same initial treatment to both corpora. Tokens are three or more letter characters with possible internal punctuation (excluding em-and en-dashes). Words are lower-cased. To deal with globally frequent terms, we remove words used by more than 25% of documents in a corpus. To reduce the computational burden of a large vocabulary, we remove words occurring in fewer than five documents. We remove all documents with fewer than 20 tokens. This process removes 706 pages and 9192 court opinions from our starting science fiction and state courts corpora. We train LDA models using Mallet <ref type="bibr">(McCallum, 2002)</ref> with hyperparameter optimization occurring every 20 intervals after the first 50. We set the number of topics to be on the same order as the number of sources, so for SCI-FI we use K &#8712; <ref type="bibr">[125,</ref><ref type="bibr">250,</ref><ref type="bibr">375]</ref> and for COURTS we use K &#8712; <ref type="bibr">[25,</ref><ref type="bibr">50,</ref><ref type="bibr">75]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Evaluating Topic-Author Correlation</head><p>We introduce three ways to measure the source-specificity of topics. For concreteness we will use the terms "source" and "author" interchangeably, but a document's source could be any categorical variable. We want to identify topics that are used by relatively few authors, and more specifically topics whose "meaning" is unduly influenced by the contributions of relatively few authors.</p><p>Given a collection of D documents written by A authors such that each document d is written by a single author a, we train an LDA topic model with K topics. Then for each word token i in document d we have both a word type w id and a posterior distribution over its token-level topic assignment z di . For clarity of presentation we can assume a single topic assignment for each token and view the corpus as a data table with three columns: word type w, topic z, and author a. By summing over rows of this table we can define marginal count variables for authors N (a) and topics N (k) as well as joint count variables for the count of a word in a topic N (w, k), a topic in an author N (k, a), and a word in a topic in an author N (w, k, a). A maximum likelihood estimate of the probability of word w given topic k is</p><p>.<ref type="foot">foot_3</ref> We note that these statistics must be defined at the token level. As in <ref type="bibr">Mimno and Blei (2011)</ref> we are looking for violations of the assumption that Pr(w | k) = Pr(w | d, k). Gibbs sampling algorithms typically preserve token-level information in the form of sampling states, but EM-based algorithms often preserve only document-topic distributions &#952; d and topic-word distributions &#966; k . We can estimate the posterior distribution over topic assignments for each token in document d with word type w as</p><p>, and generate sparse representations by sampling from this distribution.</p><p>Author Entropy. We begin by measuring a topic's author diversity-how evenly its tokens are spread across authors-using the conditional entropy of authors given a topic (Eq. 1). Topics whose tokens are largely concentrated within a few authors will have low entropy, while topics more evenly spread across many authors will have high entropy. With asymmetric hyperparameter optimization we find that the most frequent topics (large &#945; k ) have high author entropy, but topics with high author entropy can have a wide range of frequencies: topics can be both rare and well-distributed.</p><p>While author entropy provides a general sense of author diversity, it does not take into account the expression of topics by authors. Content-based evaluation is especially important because many collections are not well balanced across authors. The fact that a topic is not balanced across authors does not necessarily imply that it is problematic. A novel about the voyages of a ship captain may contain a large proportion of words about sea travel and ships, while a novel that contains one minor character who is a ship captain may contain a small proportion of the same language, used in the same way. We therefore need to be able to distinguish two cases: first, a topic that is consistent across authors but that is used at different rates by different authors, and second, a topic that is not only used at different rates but has different contents across authors. In the first case we can accurately use a topic to "stand for" a particular concept of interest, while in the second case we would get a false impression of the contents of documents, because the expression of the topic in the minority authors differs from the topic as a whole.</p><p>To differentiate expected author imbalance from pathological cases, we calculate Jensen-Shannon divergence between a topic's word distribution as estimated from the full collection Pr(w|k) and two distributions that have been transformed to reduce the influence of the most prominent authors. If the topic has low author correlation then there will be little divergence between the original distribution and its transformation. This method mimics a technique for identifying "junk" topics by <ref type="bibr">AlSumait et al. (2009)</ref>.</p><p>Minus Major Author. The first transformed distribution M (Eq. 2) recalculates the probability of words based on all documents except those written by the majority author. If a topic is consistent across authors then the presence or absence of its largest author contribution (labeled a major ) should have little effect on the topic's word distribution. The larger the resulting divergence, the more influence the major author has over the topic. Unlike author entropy, this technique does not inherently favor balanced distributions of authors; a very author-imbalanced (low entropy) topic can still have a low minus major author divergence if the dominating author's contribution agrees with the remaining topic tokens.</p><p>Balanced Authors. The second transformed distribution B (Eq. 3) treats the contribution of each author equally, no matter how many words in that topic the author produces. The minus-major metric is most sensitive to the case where a single author dominates a topic, but does not handle the case where a small group of authors dominates. Using the balanced transformation we measure the similarity of each author contribution. The larger the resulting divergence between the original and transformed word distributions, the larger the variance in contributing author token usage. We check the validity of our metrics by evaluating topic models trained on SCI-FI for a wide range of topic sizes (125-1000). As seen in Figure 1, all three measures produce bimodal distributions for all topic sizes, combining highly author-specific topics and more general crosscutting ones. The proportion of cross-cutting topics remains fairly constant across topic sizes: for all of these models, over 50% of topics fall in the source-specific range. We emphasize that source-specific topics are not necessarily "bad". If the structure of the corpus were not known, these topics would provide a highly useful and coherent insight into that structure. But if, as is typical, the structure is known, more than half of the statistical capacity of these models is wasted learning distributions that simply reiterate known structure, regardless of the number of topics.</p><p>While all three measurements produce similarly shaped distributions, they do not always agree in detail. Table <ref type="table">2</ref> shows example topics that provide intuition for these differences. At the extremes, Topic A is a general, cross-cutting topic while Topic G is dramatically author-specific. While all three metrics score well for Topics A and B, in Topic B the word  paul seems out of place, but it is common enough in several authors that its word-level author entropy is not low. Topics E and G both score poorly in all three metrics, and both are highly specific to single authors (Isaac Asimov and Anne McCaffrey). But while G is clearly and exclusively names and settings, E contains the common terms robot, robots, and human, and could be confused for a general topic on artificial intelligence. The metrics are also enlightening when they disagree. Topic C has high author entropy, but only because it mixes highly author-specific words from several different authors. Since each author's contribution differs from the others it scores poorly on the two content-based metrics. Topic D is partially about Mars, but also contains author-specific character names from stories set on Mars. No single author dominates, but the contributions of each author look different. Topic F is so highly correlated with Ray Bradbury that its entropy is low and it looks different when his contribution is removed, but its words are sufficiently general that Bradbury's use of the topic is close to the other authors' (minimal) use.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Contextual Probabilistic Subsampling</head><p>In this section we present interventions that predict the effect of words and contexts, and modify an input corpus to reduce the number of overly author-specific topics in resulting models. We hypothesize that this problem is due to burstiness <ref type="bibr">(Doyle and Elkan, 2009)</ref>: words that are globally rare, but locally frequent. Dampening the author-specificity of individual word types may reduce their connection to document sources. We therefore evaluate context-specific subsampling prior to modeling, with parameters defined based on tail probabilities of word-specific parametric models.</p><p>In selecting this particular approach we follow three design principles that we believe maximize use in actual practice. First, we want interventions to be minimal and have the least possible disruption to current work processes. We therefore choose to focus on meta-algorithms for data preparation that are compatible with but independent from existing, widely implemented inference algorithms. Second, we want any user-specified parameter choices to be simple and intuitive. Although we find that entropy is a useful diagnostic metric, information theoretic metrics such as mutual information are difficult for non-experts to interpret correctly, and critical values can differ widely across collections and dimensionalities. Third, we want both the choice of interventions and the effects of interventions to be transparent to users. We initially considered methods such as adversarially trained autoencoders, but we find that directly subsampling words is much faster, simpler, and easier to explain.</p><p>Identifying Author Specific Terms. The simplest way to find author-specific terms is to find terms unique to an author. The SCI-FI collection contains an unusual number of author-specific coinages, but words used by many authors can still be highly correlated with a particular author. We therefore estimate parametric distributions for each term and compare author-specific term proportions to this closest match is a general religion topic god gods religion world religious ancient temple people faith these. In fact, the term witch never appears as a top-20 term for any topic within the 250-topic NONE models. These topics may appear for NONE when we increase the topic size to K = 1000, but at the cost of a much larger model and with no guarantee against intruding character names.</p><p>Subsampling produces cross-cutting topics. While our topics score well quantitatively, how humanly interpretable and useful are the resulting topics? Are they actually cross-cutting in nature? We address these questions by more closely examining topics generated by the CP-05 subsampling treatment. We can explore the collection by sorting authors and individual novels within topics.</p><p>The highest frequency topics from the NONE treatment are largely preserved by CP-05. These topics by their nature are very cross-cutting and filled with frequent, general words. Despite this extreme generality they can provide a way to analyze passages representing high-level discourse concepts such as inquiry (why asked ask answer question want questions should does because) and the description of events and time (during such most these course because happened effect period result).</p><p>The mid-frequency topics are concretely in nature. We find a topic describing empire, politics, and history (empire world power people war new government history political under) which is associated with Doris Lessing's Canopus in Argos series, Isaac Asimov's Foundation series, and Kim Stanley Robinsons's The Years of Rice and Salt. In line with the science fiction genre, these novels focus on expansive future and alternative histories. We also find a topic on language (language words english speak word understand spoke speech languages talk). The most prominent authors in the topic-Robert A. Heinlein, Robert Silverberg, and Poul Anderson-are among the five most prolific authors in SCI-FI, which suggests the generality of the topic. Notably the most prominent volumes are by none of these authors: Babel-17 by Samuel R. Delany, Native Tongue by Suzette Haden Elgin, and Changing Planes by Ursula K. Le Guin. All three include the social and political language as a major plot point. These three works are fundamentally tied confirming that this topic embodies a cross-cutting linguistic theme.</p><p>Looking more closely at the lower frequency robots topic (machine robot machines robots human mechanical metal brain men built), we find that it is both topically cohesive and cross-cutting. The five most-represented authors all have works heavily related to artificial intelligence: Isaac Asimov, Robert Silverberg, Stanis&#322;aw Lem, Clifford D. Simak, and Philip K. Dick. The most-represented volumes tell a similar story with Men and machines by Robert Silverberg, The complete robot by Isaac Asimov, and The Humanoids by Jack Williamson holding the top three ranks. Reassuringly, there are well-represented novels by less-represented authors such as The Starchild Trilogy by Fredrick Pohl and Jack Williamson. The low frequency of this topic is surprising given the presence in the collection of robot-related novels, especially works by Isaac Asimov. This discrepancy revealed that an Asimov-specific topic (human being law might must such without may robot beings) has persisted. Many authors receive a non-negligible token representation, but Asimov's token count is still a factor of ten larger than the second most prominent author (Robert A. Heinlein).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>We present a formal definition of the problem of overly source-specific topics, three evaluation metrics to measure the degree of source-specificity, and a simple text curation meta-algorithm that dramatically reduces the number of source-specific topics. This approach has immediate practical application for the many collections that combine multiple distinct sources, but it also has important theoretical implications.</p><p>We view this work as a preliminary step towards predictive theories of latent semantics, beyond purely descriptive models. Despite ample practical evidence that interventions such as stoplist curation can have significant effects, most previous work has focused on algorithms for identifying a single "optimal" low-dimensional semantic representation. Our results indicate that there are potentially many interventions in text collections that each have distinct but predictable effects on the results of algorithms. Just as biologists use multiple stains to view different aspects of microorganisms using the same microscope, users of text mining algorithms should be able to choose multiple distinct text treatments, each with its own predictable effects, to meet distinct user needs.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Code and data is available at https://github.com/laurejt/authorless-tms.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://www.worldswithoutend.com/lists.asp</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>https://www.courtlistener.com</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>We do not use Dirichlet smoothing for the purposes of this work for simplicity and to make more reliable comparisons across varying vocabulary sizes. Results using smoothing are similar.</p></note>
		</body>
		</text>
</TEI>
