<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Improving Neural Topic Models using Knowledge Distillation</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2020 November</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10225167</idno>
					<idno type="doi">10.18653/v1/2020.emnlp-main.137</idno>
					<title level='j'>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Alexander Miserlis Hoyle</author><author>Pranav Goel</author><author>Philip Resnik</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Abstract: Topic models are often used to identify human-interpretable topics to help make sense of large document collections. We use knowledge distillation to combine the best attributes of probabilistic topic models and pretrained transformers. Our modular method can be straightforwardly applied with any neural topic model to improve topic quality, which we demonstrate using two models having disparate architectures, obtaining state-of-the-art topic coherence. We show that our adaptable framework not only improves performance in the aggregate over all estimated topics, as is commonly reported, but also in head-to-head comparisons of aligned topics.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>of data, then using its high-quality probability estimates over outputs to guide a smaller student model. Since the information contained in these estimates is useful-a picture of an ox will yield higher label probabilities for BUFFALO than APRI-COT-the student needs less data to train and can generalize better.</p><p>We show how this principle can apply equally well to improve unsupervised topic modeling, which to our knowledge has not previously been attempted. While distillation usually involves two models of the same type, it can also apply to models of differing architectures. Our method is conceptually quite straightforward: we fine-tune a pretrained transformer <ref type="bibr">(Sanh et al., 2019)</ref> on a document reconstruction objective, where it acts in the capacity of an autoencoder. When a document is passed through this BERT autoencoder, it generates a distribution over words that includes unobserved but related terms. We then incorporate this distilled document representation into the loss function for topic model estimation. <ref type="bibr">(See Figure 1.)</ref> To connect this method to the more standard supervised knowledge distillation, observe that the unsupervised "task" for both an autoencoder and a topic model is the reconstruction of the original document, i.e. prediction of a distribution over the vocabulary. The BERT autoencoder, as "teacher", provides a dense prediction that is richly informed by training on a large corpus. The topic model, as "student", is generating its own prediction of that distribution. We use the former to guide the latter, essentially as if predicting word distributions were a multi-class labeling problem.<ref type="foot">foot_0</ref> Our approach, which we call BERT-based Autoencoder as Teacher (BAT), obtains best-in-class results on the most commonly used measure of topic coherence, normalized pointwise mutual information (NPMI, <ref type="bibr">Aletras and Stevenson, 2013)</ref> compared against recent state-of-the-art-models that serve as our baselines.</p><p>In order to accomplish this, we adopt neural topic models (NTM, <ref type="bibr">Miao et al., 2016;</ref><ref type="bibr">Srivastava and Sutton, 2017;</ref><ref type="bibr">Card et al., 2018;</ref><ref type="bibr">Burkhardt and Kramer, 2019;</ref><ref type="bibr">Nan et al., 2019, inter alia)</ref>, which use various forms of black-box distributionmatching <ref type="bibr">(Kingma and Welling, 2014;</ref><ref type="bibr">Tolstikhin et al., 2018)</ref>. <ref type="foot">2</ref> These now surpass traditional methods (e.g. <ref type="bibr">LDA, Blei, 2003, and variants)</ref> in topic coherence. In addition, it is easier to modify the generative model of a neural topic model than for a classic probabilistic latent-variable model, where changes generally require investing effort in new variational inference procedures or samplers. In fact, because we leave the base NTM unmodified, our approach is flexible enough to easily accommodate any neural topic model, so long as it includes a word-level document reconstruction objective. We support this claim by demonstrating improvements on models based on both Variational <ref type="bibr">(Card et al., 2018)</ref> and Wasserstein <ref type="bibr">(Nan et al., 2019)</ref> auto-encoders.</p><p>To summarize our contributions:</p><p>&#8226; We introduce a novel coupling of the knowledge distillation technique with generative graphical models.</p><p>&#8226; We construct knowledge-distilled neural topic models that achieve better topic coherence than their counterparts without distillation on three standard English-language topicmodeling datasets.</p><p>&#8226; We demonstrate that our method is not only effective but modular, by improving topic coherence in a base state-of-the-art model by modifying only a few lines of code. <ref type="foot">3</ref>&#8226; In addition to showing overall improvement across topics, our method preserves the topic analysis of the base model and improves coherence on a topic-by-topic basis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Methodology 2.1 Background on topic models</head><p>Topic modeling is a well-established probabilistic method that aims to summarize large document corpora using a much smaller number of latent topics. The most prominent instantiation, LDA <ref type="bibr">(Blei, 2003)</ref>, treats each document as a mixture over K latent topics, &#952; d , where each topic is a distribution over words &#946; k . By presenting topics as ranked word lists and documents in terms of their probable topics, topic models can provide legible and concise representations of both the entire corpus and individual documents.</p><p>In classical topic models like LDA, distributions over the latent variables are estimated with approximate inference algorithms tailored to the generative process. Changes to the model specificationfor instance, the inclusion of a supervised labelrequires attendant changes in the inference method, which can prove onerous to derive. For some probabilistic models, this problem may be circumvented by the variational auto-encoder (VAE, <ref type="bibr">Kingma and Welling, 2014)</ref>, which introduces a recognition model that approximates the posterior with a neural network. As a result, neural topic models have capitalized on the VAE framework <ref type="bibr">(Srivastava and Sutton, 2017;</ref><ref type="bibr">Card et al., 2018;</ref><ref type="bibr">Burkhardt and Kramer, 2019, inter alia)</ref> and other deep generative models <ref type="bibr">(Wang et al., 2019;</ref><ref type="bibr">Nan et al., 2019)</ref>. In addition to their flexibility, the best models now yield more coherent topics than LDA.</p><p>Although our method (Section 2.3) is agnostic as to the choice of neural topic model, we borrow from <ref type="bibr">Card et al. (2018)</ref> for both formal exposition and our base implementation (Section 3). <ref type="bibr">Card et al. (2018)</ref> develop SCHOLAR, a generalization of the first successful VAE-based neural topic model (PRODLDA, <ref type="bibr">Srivastava and Sutton, 2017)</ref>. The generative story is broadly similar to that of LDA, although the uniform Dirichlet prior is replaced with a logistic normal (LN ):<ref type="foot">foot_3</ref> </p><p>For each document d:</p><p>-Draw topic distribution &#952; d &#8764; LN (&#945; 0 ) -For each word w id in the document:</p><p>Following PRODLDA, B is a K &#215; V matrix where each row corresponds to the kth topic-word probabilities in log-frequency space. The multinomial distribution over a document's words is parameterized by</p><p>where m is a vector of fixed empirical background word frequencies and &#963;(&#8226;) is the softmax function.</p><p>We highlight that each document is treated as a bag of words, w BOW d . To perform inference on the model, VAE-based models like SCHOLAR approximate the true intractable posterior p(&#952; d | &#8226;) with a neural encoder network g(w d ) that parameterizes the variational distribution q (&#952; d | g(&#8226;)) (here, a logistic normal with diagonal covariance). The Evidence Lower BOund (ELBO) is therefore</p><p>which is optimized with stochastic gradient descent.</p><p>The form of the reconstruction error L R is a consequence of the independent multinomial draws.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Background on knowledge distillation</head><p>It is instructive to think of Eq. ( <ref type="formula">1</ref>) as a latent logistic regression, intended to approximate the distribution over words in a document. Under this lens, the neural topic model outlined above can be cast as a multi-label classification problem. Indeed, it accords with the standard structure: there is a softmax over logits estimated by a neural network, coupled with a cross-entropy loss. However, because w BOW d is a sparse bag of words, the model is limited in its ability to generalize. During backpropagation (Eq. ( <ref type="formula">3</ref>)), the topic parameters will only update to account for observed terms, which can lead to overfitting and topics with suboptimal coherence.</p><p>In contrast, dense document representations can capture rich information that bag-of-words representations cannot.</p><p>These observations motivate our use of knowledge distillation (KD, <ref type="bibr">Hinton et al., 2015)</ref>. The authors argue that the knowledge learned by a large "cumbersome" classifier on extensive data-e.g., a deep neural network or an ensemble-is expressed in its probability estimates over classes, and not just contained in its parameters. Hence, these teacher estimates for an input may be repurposed as soft labels to train a smaller student model. In practice, the loss against the true labels is linearly interpolated with a loss against the teacher probabilities, Eq. (4). We discuss alternative ways to integrate outside information in Section 6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Combining neural topic modeling with knowledge distillation</head><p>The knowledge distillation objective. To apply KD to a "base" neural topic model, we replace the reconstruction term L R in Eq. ( <ref type="formula">3</ref>) with L KD , as follows:</p><p>Here, z BAT d are the logits produced by the teacher network for a given input document d, meaning that w BAT d acts as a smoothed pseudo-document. T is the softmax temperature, which controls how diffuse the estimated probability mass is over the words (hence f (&#8226;; T ) is Eq. ( <ref type="formula">1</ref>) with the corresponding scaling). This differs from the original KD in two ways: (a) it scales the estimated probabilities by the document length N d , and (b) it uses a multilabel loss.</p><p>The teacher model. We generate the teacher logits z BAT using the pretrained transformer DISTIL-BERT <ref type="bibr">(Sanh et al., 2019)</ref>, itself a distilled version of BERT <ref type="bibr">(Devlin et al., 2019)</ref>.<ref type="foot">foot_4</ref> BERT-like models are generally pretrained on large domain-general corpora with a language-modeling like objective, yielding an ability to capture nuances of linguistic context more effectively than bag-of-words models <ref type="bibr">(Clark et al., 2019;</ref><ref type="bibr">Liu et al., 2019;</ref><ref type="bibr">Rogers et al., 2020)</ref>. Mirroring the NTM's formulation as a variational auto-encoder, we treat DISTILBERT as a deterministic auto-encoder, fine-tuning it with the document-reconstruction objective L R on the same dataset. Thus, we use a BERT-based Autoencoder as our Teacher model, hence BAT. 6   Clipping the logit distribution. Depending on preprocessing, V may number in the tens of thousands of words. This leads to a long tail of probability mass assigned to unlikely terms, and breaks standard assumptions of sparsity. <ref type="bibr">Tang et al. (2020)</ref> working in a classification setting, find that truncating the logits to the top-n classes and assigning uniform mass to the rest improves accuracy. We instead choose the top c N d , c &#8712; R + logits and assign zero probability to the remaining elements to enforce sparsity.</p><p>3 Experimental Setup</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Data and Metrics</head><p>We validate our approach using three readily available datasets that vary widely in domain, corpus and vocabulary size, and document length: 20 Newsgroups (20NG, <ref type="bibr">Lang, 1995)</ref> We seek to discover a latent space of topics that is meaningful and useful to people <ref type="bibr">(Chang et al., 2009)</ref>. Accordingly, we evaluate topic coherence using normalized mutual pointwise information (NPMI), which is significantly correlated with human judgments of topic quality <ref type="bibr">(Aletras and Stevenson, 2013;</ref><ref type="bibr">Lau et al., 2014)</ref> and widely used to evaluate topic models. <ref type="foot">11</ref> We follow precedent and calculate (internal) NPMI using the top ten words in each topic, taking the mean across the NPMI scores for individual topics. Internal NPMI is estimated with reference co-occurrence counts from a held-out dataset from the same corpus, 7 qwone.com/ &#732;jason/20Newsgroups 8 s3.amazonaws.com/research.metamind. io/wikitext/wikitext-103-v1.zip 9 ai.stanford.edu/ &#732;amaas/data/sentiment i.e., the dev or test split. While internal NPMI is the metric of choice for most prior work, we also provide external NPMI results using Gigaword 5 <ref type="bibr">(Parker et al., 2011)</ref>, following <ref type="bibr">Card et al. (2018)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Experimental Baselines</head><p>We select three experimental baseline models that represent diverse styles of neural topic modeling. <ref type="foot">12</ref>Each achieves the highest NPMI on the majority of its respective datasets, as well as a considerable improvement over previous neural and non-neural topic models (such as <ref type="bibr">Srivastava and Sutton, 2017;</ref><ref type="bibr">Miao et al., 2016;</ref><ref type="bibr">Ding et al., 2018)</ref>. All our baselines are roughly contemporaneous with one another, and had yet to be compared in a head-to-head fashion prior to our work. SCHOLAR.  <ref type="bibr">(Tolstikhin et al., 2018)</ref>, using a Dirichlet prior that is matched by minimizing Maximum Mean Discrepancy. They find the method leads to state-of-the-art coherence on several datasets and encourages topics to exhibit greater word diversity.</p><p>We demonstrate the modularity of our core innovation by combining our method with both SCHOLAR and W-LDA (Section 4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Our Models and Settings</head><p>As discussed in Section 2.3, our approach relies on a "base" neural topic model and unnormalized probabilities over words estimated by a transformer as "teacher". We discuss each in turn.</p><p>Neural topic models augmented with knowledge distillation. We experiment with both SCHOLAR and W-LDA as base models. The former constitutes our primary model and point of comparison with baselines, while the latter is a proof-of-concept that attests to our method's modularity; we added knowledge distillation to W-LDA with only a few lines of code (Appendix F). We evaluate both at K = 50 and K = 200 topics.</p><p>We tune using NPMI, with reference cooccurrence counts taken from a held-out development set from the relevant corpus. For our baselines, we use the publicly-released author implementations. <ref type="foot">13</ref> While we generally attempt to retain the original hyperparameter settings when available, we do perform an exhaustive grid search on the SCHOLAR baselines and SCHOLAR+BAT to ensure fairness in comparison (ranges, optimal values, and other details in Appendix E.1).</p><p>Our method also introduces additional hyperparameters: the weight for KD loss, &#955; (Eq. ( <ref type="formula">4</ref>)); the softmax temperature T ; and the proportion of the word-level teacher logits that we retain (relative to document length, see clipping in Section 2.3). For most dataset-K pairs, we find that we can improve topic quality under most settings, with a relatively small set of values for each hyperparameter leading to better results. In fact, following the extensive search on SCHOLAR+BAT, we found we could tune W-LDA within a few iterations.</p><p>Topic models rely on random sampling procedures, and to ensure that our results are robust, we report the average values across five runs (previously unreported by the authors of our baselines).</p><p>The DISTILBERT teacher. We fine-tune a modified version of DISTILBERT with the same document reconstruction objective as the NTM (L R , Eq. ( <ref type="formula">3</ref>)) on the training data. Specifically, DISTILBERT maps a WordPiece-tokenized <ref type="bibr">(Wu et al., 2016)</ref> document d to an l-dimensional hidden vector with a transformer <ref type="bibr">(Vaswani et al., 2017)</ref>, then back to logits over V words (tokenized with the same scheme as the topic model). For long documents, we split into blocks of 512 tokens and mean-pool the transformer outputs. We use the pretrained model made available by the authors (Wolf  <ref type="table">Table 2:</ref> The NPMI for our baselines (Section 3.2) compared with BAT (explained in Section 2.3) using SCHOLAR as our base neural architecture. We achieve better NPMI than all baselines across three datasets and K = 50, K = 200 topics. We use 5 random restarts and report the standard deviation.</p><p>et al., 2019). We train until perplexity converges on the same held-out dev set used in the topic modeling setting. Unsurprisingly, DISTILBERT achieves dramatically lower perplexity than all topic model baselines. Note that we need only train the model once per corpus, and can experiment with different NTM variations using the same z BAT .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Results and Discussion</head><p>Using the VAE-based SCHOLAR as the base model, topics discovered using BAT are more coherent, as measured via NPMI, than previous state-of-theart baseline NTMs (Table <ref type="table">2</ref>), improving on the DVAE and W-LDA baselines, and the baseline of SCHOLAR without the KD augmentation. We establish the robustness of our approach's improvement by taking the mean across multiple runs with different random seeds, yielding consistent improvement over all baselines for all the datasets. We validate the approach using a smaller and larger number of topics, K = 50 and 200, respectively. In addition to its improved performance, BAT can apply straightforwardly to other models, because it makes very few assumptions about the base model-requiring only that it rely on a word-level reconstruction objective, which is true of the majority of neural topic models proposed to date. We illustrate this by using the Wasserstein auto-encoder (W-LDA) as a base NTM, showing in Table <ref type="table">3</ref> that BAT improves on the unaugmented model. 14  We report the dev set results (corresponding to the test set results in Tables <ref type="table">2</ref> and<ref type="table">3</ref>) in Appendix A-the same pattern of results is obtained, for all the models. 14 We note that the W-LDA baseline did not tune well on 200 topics, further complicated by the model's extensive run time. As such, we focus on augmenting that model for 50 topics, consistent with the number of topics on which <ref type="bibr">Nan et al. (2019)</ref> report their results. We add preliminary results using BAT with DVAE in Appendix C.</p><p>Finally, we also compute NPMI using reference counts from an external corpus (Gigaword 5, <ref type="bibr">Parker et al., 2011)</ref> for SCHOLAR and SCHOLAR+BAT (Table <ref type="table">4</ref>). We find the same patterns generally hold: in all but one setting (Wiki, K = 50), BAT improves topic coherence relative to SCHOLAR. These external NPMI results suggest that our model avails itself of the distilled general language knowledge from pretrained BERT, and moreover that our fine-tuning procedure does not overfit to the training data.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Impact of BAT on Individual Topics</head><p>Following standard practice, we have established that our models discover more coherent topics on average when compared to others (Tables <ref type="table">2</ref> and<ref type="table">3</ref>).  A limitation of these approaches is that they simply import general, non-corpus-specific word-level information. In contrast, representations from a pretrained transformer can benefit from both general language knowledge and corpus-dependent information, by way of the pretraining and fine-tuning regime. By regularizing toward representations conditioned on the document, we remain coherent relative to the topic model data. An additional key advantage for our method is that it involves only a slight change to the underlying topic model, rather than the specialized designs by the above methods.</p><p>Knowledge distillation. While the focus was originally on single-label image classification, KD has also been extended to the multi-label setting <ref type="bibr">(Liu et al., 2018b)</ref>. In NLP, KD has usually been applied in supervised settings <ref type="bibr">(Kim and Rush, 2016;</ref><ref type="bibr">Huang et al., 2018;</ref><ref type="bibr">Yang et al., 2020)</ref>, but also in some unsupervised tasks (usually using an unsupervised teacher for a supervised student) <ref type="bibr">(Hu et al., 2020;</ref><ref type="bibr">Sun et al., 2020)</ref>. <ref type="bibr">Xu et al. (2018)</ref> use word embeddings jointly learned with a topic model in a procedure they term distillation, but do not follow the method from <ref type="bibr">Hinton et al. (2015)</ref> that we employ (instead opting for joint-learning). Recently, pretrained models like BERT have offered an attractive choice of teacher model, used successfully for a variety of tasks such as sentiment classification and paraphrasing <ref type="bibr">(Tang et al., 2019a,b)</ref>. Work in distillation often cites a reduction in computational cost as a goal (e.g., <ref type="bibr">Sanh et al., 2019)</ref>, although we are aware of at least one effort that is focused specifically on interpretability <ref type="bibr">(Liu et al., 2018a)</ref>.</p><p>Topic diversity. Coherence, commonly quantified automatically using NPMI, is the current standard for evaluating topic model quality. Recently several authors <ref type="bibr">(Dieng et al., 2020;</ref><ref type="bibr">Burkhardt and Kramer, 2019;</ref><ref type="bibr">Nan et al., 2019)</ref> have proposed additional metrics focused on the diversity or uniqueness of topics (based on top words in topics). However, no one metric has yet achieved acceptance or consensus in the literature. Moreover, such measures fail to distinguish between the case where two topics share the same set of top n words, therefore coming across as essentially identical, versus when one topic's top n words are repeated individually across multiple other topics, indicating a weaker and more diffuse similarity to those topics. We discuss issues related to topic diversity in Appendix D.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusions and Future Work</head><p>To our knowledge, we are the first to distill a "blackbox" neural network teacher to guide a probabilistic graphical model. We do this in order to combine the expressivity of probabilistic topic models with the precision of pretrained transformers. Our modular method sits atop any neural topic model (NTM) to improve topic quality, which we demonstrate using two NTMs of highly disparate architectures (VAEs and WAEs), obtaining state-of-the-art topic coherence across three datasets from different domains. Our adaptable framework does not just produce improvements in the aggregate (as is commonly reported): its effect can be interpreted more specifically as identifying the same space of topics generated by an existing model and, in most cases, improving the coherence of individual topics, thus highlighting the modular value of our approach.</p><p>In future work, we also hope to explore the effects of the pretraining corpus <ref type="bibr">(Gururangan et al., 2020)</ref> and teachers (besides BERT) on the generated topics. Another intriguing direction is exploring the connection between our methods and neural network interpretability. The use of knowledge distillation to facilitate interpretability has also been previously explored, for example, in <ref type="bibr">Liu et al. (2018a)</ref> to learn interpretable decision trees from neural networks. In our work, as the weight on the BERT autoencoder logits &#955; goes to one, the topic model begins to describe less the corpus and more the teacher. We believe mining this connection can open up further research avenues; for instance, by investigating the differences in such teacher-topics conditioned on the pre-training corpus. Finally, although we are motivated primarily by the widespread use of topic models for identifying interpretable topics <ref type="bibr">(Boyd-Graber et al., 2017, Ch.</ref> 3), we plan to explore the ideas presented here further in the context of downstream applications like document classification. The development-set NPMI for our baselines (Section 3.2) compared with BAT (explained in Section 2.3) using SCHOLAR as our base neural architecture. We achieve better NPMI than all baselines across three datasets and K = 50, K = 200 topics. We use 5 random restarts report the standard deviation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix A Dev Set Results</head><p>We optimized our models on the dev set, froze the optimal models, and showed the results on the test set in Tables <ref type="table">2</ref> and<ref type="table">3</ref>. We show the corresponding dev set results for those models in Tables <ref type="table">6</ref> and<ref type="table">7</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>20NG Wiki IMDb</head><p>W-LDA 0.294 (0.014) 0.500 (0.013) 0.136 (0.009) +BAT 0.316 (0.010) 0.511 (0.016) 0.162 (0.003)</p><p>Table <ref type="table">7</ref>: The mean development-set NPMI (std. dev.) across 5 runs for W-LDA and W-LDA+BAT for K = 50, showing improvement on all datasets. This demonstrates that our innovation is modular and can be used with base neural topic models that vary in architecture.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Extrinsic Classification Results</head><p>The primary goal of our method is to improve the coherence of generated topics. It is natural, however, to ask about the impact of our method on downstream applications. We include here a preliminary exploration suggesting that the addition of BAT does not hurt performance in document classification.</p><p>In our setup, we seek to predict document labels y d from the MAP estimate of a document's topic distribution, &#952; d . Specifically, we classify the newsgroup to which a document was posted for the 20 newsgroups data (e.g., talk.politics.misc) and a binary sentiment label for the IMDb review data. We train a random forest classifier using default parameters from scikit-learn <ref type="bibr">(Pedregosa et al., 2011)</ref> and report the accuracies in Table <ref type="table">8</ref> (averaged across 5 runs).</p><p>Much like other work that is aimed at topic coherence rather than their downstream use in supervised models <ref type="bibr">(Nan et al., 2019)</ref>, we find that our method has little impact on predictive performance. While it is possible that improvements may be obtained by specifically tuning models for classification, or by integrating BAT into model variations that combine lexical and topic representations (e.g. <ref type="bibr">Nguyen et al., 2013)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Using BAT with DVAE</head><p>We further illustrate our method's modularity by applying BAT to our own reimplementation of DVAE <ref type="bibr">(Burkhardt and Kramer, 2019)</ref>. <ref type="foot">16</ref> In contrast to the author's primary implementation, which estimates the model with rejection sampling variational inference (used in Section 4), we reimplemented DVAE, approximating the Dirichlet gradient via pathwise derivatives <ref type="bibr">(Jankowiak and Obermeyer, 2018)</ref>, similar to Burkhardt and Kramer (2019)'s alternative model variant using implicit gradients.</p><p>Our reimplementation shows baseline behavior substantially similar to the author's implementation. In the course of our experimentation, we noted a degeneracy in this model, in which high NPMI is achieved but at the cost of redundant topics. This failure mode is well-established, but as discussed in Appendix D.2, we find the measures proposed to diagnose topic diversity (including those proposed by <ref type="bibr">Burkhardt and Kramer, 2019;</ref><ref type="bibr">Nan et al., 2019)</ref> to be problematic. Rather than use these metrics, therefore, we took a coarse but simple approach and filtered out any models that yielded more than one pair of identical topics, averaged across five runs (defined as having two topics with the same set of top-10 words). This filtering eliminated many hyperparameter settings, leading us to believe that DVAE is not robust to this problem.</p><p>Ultimately, we find that applying BAT to DVAE does not hurt, and also does not help appreciably (Table <ref type="table">9</ref>). In addition, when applying the above filtering criterion to our main SCHOLAR and SCHOLAR + BAT models, we still obtain the positive results reported in Table <ref type="table">6</ref>. 17</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>20NG</head><p>Wiki IMDb DVAE 0.376 (0.004) 0.517 (0.006) 0.169 (0.007) +BAT 0.401 (0.005) 0.515 (0.007) 0.169 (0.006) </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Methodological Notes D.1 Using BERT in the encoder</head><p>In SCHOLAR, the encoder takes the following form:</p><p>Where the weight matrix W , along with the parameters of nueral networks &#181;(&#8226;) and &#963;(&#8226;), are our variational parameters. <ref type="bibr">Card et al. (2018)</ref> propose that pre-trained word2vec <ref type="bibr">(Mikolov et al., 2013)</ref> embeddings can replace W , meaning that the document representation made available to the encoder is an ldimensional sum of word embeddings. <ref type="bibr">Card et al. (2018)</ref> argue that fixed embeddings act as an inductive prior which improves topic coherence. Likewise, we might want to encode the document representation from a BERT-like model and, in fact, this has been attempted with some success <ref type="bibr">(Bianchi et al., 2020)</ref>. The hypothesis is that a structuredependent representation of the document can better parameterize its corresponding topic distribution. 17 For K = 50. The single-pair threshold proves too restrictive for the K = 200 case, where no hyperparameter settings pass the threshold. Increasing the tolerance to a maximum of 5 redundant pairs with K = 200 leads to a somewhat lower average NPMI overall, but the same directional improvement, i.e. SCHOLAR+BAT yields a significantly higher NPMI than SCHOLAR.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Setting NPMI</head><p>Randomly updated embeds. 0.170 (0.007) Fixed word2vec embeds. 0.172 (0.004) Random 784-dim doc. rep. + w2v 0.175 (0.007) Mean-pooled 784-dim BERT output + w2v 0.172 (0.002) Random 5000-dim doc. rep. + w2v 0.178 (0.007) 5000-dim predicted probs. from BAT + w2v 0.180 (0.008) Table <ref type="table">10</ref>: Effect on topic coherence of passing various document representations to the SCHOLAR encoder (using the IMDb data). Each setting describes the document representation provided to the encoder, which is transformed by one feed-forward layer of 300dimensions followed by a second down to K dimensions. "+ w2v" indicates that we first concatenated with the sum of the 300-dimensional word2vec embeddings for the document. Note that these early findings are based on a different IMDb development set, a 20% split from the training data. They are thus not directly comparable to the results reported elsewhere in the text, which used a separate held-out development set.</p><p>We experimented with this method as well, using both the hidden BERT representation and the predicted probabilities, although we also include a fixed randomized baseline to maintain parameter parity. Results for IMDb are reported in Table <ref type="table">10</ref>, and we find at best a mild improvement over the baselines. <ref type="foot">18</ref> We suspect the reason for this tepid result is both that (a) in training, the effect of estimated local document-topic proportions on the global topic-word distributions is diffuse and indirect; and (b) the compression of the representation into k dimensions causes too much of the highlevel linguistic information to be lost. Nonetheless, owing to the slight benefit, we do pass the logits to the encoder in our SCHOLAR-based model. We avoid this change for the model based on W-LDA to underscore the modularity of our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Topic Diversity</head><p>Burkhardt and Kramer (2019) have found a degeneracy in some topic models, wherein a single topic will be repeated more than once with slightly varying terms (e.g., several Dadaism topics). <ref type="bibr">Burkhardt and Kramer (2019)</ref> and others <ref type="bibr">(Nan et al., 2019;</ref><ref type="bibr">Dieng et al., 2020)</ref> have independently proposed related metrics to quantify the problem, but the literature has not converged on a solution. In contrast to NPMI, we are not aware of any work that as-sesses the validity of such metrics with respect to human judgements.</p><p>Moreover, all these proposals suffer from a common problem: because they are global measures of word overlap, they fail to account for how words are repeated across topics. For instance, Topic Uniqueness <ref type="bibr">(Nan et al., 2019)</ref> is identical regardless of whether all of a topic's top words are all repeated in a single second topic, or individual top words from that topic are repeated in several other topics. In addition, the measures inappropriately penalize partially-related topics.</p><p>They also penalize polysemy-and, more generally, the contextual flexibility of word meanings. One of the key advantages of latent topics, compared to surface lexical summaries, is that the same word can contribute differently to an understanding of what different topics are about. As a real example from our experience, in modeling a set of documents related to paid family and medical leave, words like parent, mother, and father are prominent in one topic related to parental leave when a child is born (accompanying other terms like newborn and maternity leave) and also in another topic related to taking leave to care for family members, including elderly parents (accompanying other terms like elderly and aging). The fact that topic models permit a word like parent to be prominent in both of these clearly distinct topics, emphasizing two different aspects of the word relative to the collection as a whole (being a parent taking care of children, being a child taking care of parents), is a feature, not a bug. We consider the question of topic diversity an important direction for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Experimental Procedures</head><p>In this section, we first provide details of our hyperparameters and tuning procedures, then turn to our computing infrastructure and the rough runtime of the SCHOLAR model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.1 Hyperparameter Tuning and Optimal Values</head><p>We used well-tuned baselines to establish thresholds for performance on NPMI (following the reported hyperparameters in <ref type="bibr">Card et al., 2018;</ref><ref type="bibr">Burkhardt and Kramer, 2019;</ref><ref type="bibr">Nan et al., 2019)</ref>. While developing our model, we performed a coarse-grained initial hyperparameter sweep to identify ranges that were not beating the threshold, and decided to exclude those ranges when performing a full grid search. We report the hyperparameter ranges used in this search, along with their optimal values (as determined by development set NPMI), in Tables <ref type="table">11 to 15</ref>. These produced the final set of results (Tables <ref type="table">2, 3, 6</ref> and<ref type="table">7</ref>).</p><p>For the DISTILBERT training, we use the default hyperparameter settings for the bert-base-uncased model <ref type="bibr">(Wolf et al., 2019)</ref>. Our code is a modified version of the MM-IMDB multimodal sequence classification code from the same codebase as DISTILBERT (https:</p><p>and we use all default hyperparameter settings specified there. We train for 7500 steps for 20ng, and 17000 steps for Wiki and IMDb (this corresponds to convergence on development-set perplexity).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.2 Computing Infrastructure and Runtime</head><p>For the full hyperparameter sweep, we used an Amazon Web Services ParallelCluster <ref type="url">https:// github.com/aws/aws-parallelcluster</ref> with 40 nodes of g4dn.xlarge instances (consisting of Nvidia T4 GPUs with 16 GB RAM), which ran for about 5 days. For initial experimentation, we used a SLURM cluster with a mix of consumer-grade Nvidia GPUs (e.g., 1080, 2080).</p><p>In terms of runtime, SCHOLAR) and our own SCHOLAR+BAT are equal and this is true for any of our baseline model augmented with BAT. It is important to note that the overhead in terms of the overall runtime comes only from training the DISTILBERT encoder on the full dataset first and inference time for obtaining the logits after training. Thus, users should keep in mind the initial step of training and inferring teacher model logits and saving them; once that is done for the dataset, our model does not add to the runtime. We show the comparison between the full runtimes, including the initial step, in Fig. <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Changes to W-LDA</head><p>In Fig. <ref type="figure">5</ref>, we show the changes to the W-LDA model necessary to accommodate our method. Ignoring the code to load &amp; clip the logits, also constituting a minor change, we introduce about a dozen lines.</p><p>(Dataset: 20NG)  </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>An interesting conceptual link here can be found in Latent Semantic Analysis (LSA,<ref type="bibr">Landauer and Dumais, 1997)</ref>, an early predecessor of today's topic models. The original discussion introducing LSA has a very autoencoder-like flavor, explicitly illustrating the deconstruction of a collection of sparsely represented documents and the reconstruction of a dense document-word matrix.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>As a standard example,<ref type="bibr">Srivastava and Sutton (2017)</ref> encode a document's bag-of-words with a neural network to parameterize the latent topic distribution, then sample from the distribution to reconstruct the BoW.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>See Appendix F. Our full implementation, including dataset preprocessing, is available at github.com/ahoho/ kd-topic-models.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>This choice is because the reparameterization trick behind VAEs used to be limited to location-scale distributions, but recent developments (e.g.,<ref type="bibr">Figurnov et al., 2018)</ref> have lifted that restriction, as<ref type="bibr">Burkhardt and Kramer (2019)</ref> demonstrate with several Dirichlet-based NTMs using VAEs.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>DISTILBERT's light weight accommodates longer documents, necessary for topic modeling. Even with this change, we divide very long documents into chunks, estimating logits for each chunk and taking the pointwise mean. More complex schemes (i.e., LSTMs, Hochreiter and Schmidhuber, 1997) yielded no benefit.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>A reader familiar with variational NTMs may notice that we haven't mentioned an obvious means of incorporating representations from a pretrained transformer: encoding the document representation from a BERT-like model. This yields unimpressive results; see Appendix D.1.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_6"><p>The splits are used to estimate NPMI. Dev splits are used to select hyperparameters, and test splits are run after hyperparameters are selected and frozen.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_7"><p>We also obtain competitive results for document perplexity, which has also been used widely but correlates negatively with human coherence evaluations<ref type="bibr">(Chang et al., 2009)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_8"><p>This use of "baseline" should not be confused with the "base" neural topic model augmented by knowledge distillation (Section 2.3).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_9"><p>SCHOLAR: github.com/dallascard/scholar W-LDA: github.com/awslabs/w-lda DVAE:github.com/sophieburkhardt/ dirichlet-vae-topic-models For augmented models we start with our own reimplementations of the baseline approaches in a common codebase, validated by obtaining comparable results to the original authors on their datasets.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_10"><p>We appreciate a reviewer's suggestion that we add a +BAT comparison for DVAE.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="18" xml:id="foot_11"><p>We also fail to reproduce the findings of<ref type="bibr">Card et al. (2018)</ref>, showing no meaningful improvement in topic coherence with fixed word2vec embeddings. It appears that this is a consequence of their tuning for perplexity rather than NPMI.</p></note>
		</body>
		</text>
</TEI>
