skip to main content


Title: GraphBTM: Graph Enhanced Autoencoded Variational Inference for Biterm Topic Model
Discovering the latent topics within texts has been a fundamental task for many applica- tions. However, conventional topic models suffer different problems in different settings. The Latent Dirichlet Allocation (LDA) may not work well for short texts due to the data sparsity (i.e., the sparse word co-occurrence patterns in short documents). The Biterm Topic Model (BTM) learns topics by mod- eling the word-pairs named biterms in the whole corpus. This assumption is very strong when documents are long with rich topic in- formation and do not exhibit the transitivity of biterms. In this paper, we propose a novel way called GraphBTM to represent biterms as graphs and design Graph Convolutional Net- works (GCNs) with residual connections to extract transitive features from biterms. To overcome the data sparsity of LDA and the strong assumption of BTM, we sample a fixed number of documents to form a mini-corpus as a training instance. We also propose a dataset called All News extracted from (Thompson, 2017), in which documents are much longer than 20 Newsgroups. We present an amortized variational inference method for GraphBTM. Our method generates more coherent topics compared with previous approaches. Exper- iments show that the sampling strategy im- proves performance by a large margin.  more » « less
Award ID(s):
1747783
NSF-PAR ID:
10084512
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2018 Conference on Empirical Methods in Natural Language Processing (EMNLP 2018)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Discovering the latent topics within texts has been a fundamental task for many applica- tions. However, conventional topic models suffer different problems in different settings. The Latent Dirichlet Allocation (LDA) may not work well for short texts due to the data sparsity (i.e., the sparse word co-occurrence patterns in short documents). The Biterm Topic Model (BTM) learns topics by mod- eling the word-pairs named biterms in the whole corpus. This assumption is very strong when documents are long with rich topic in- formation and do not exhibit the transitivity of biterms. In this paper, we propose a novel way called GraphBTM to represent biterms as graphs and design Graph Convolutional Net- works (GCNs) with residual connections to extract transitive features from biterms. To overcome the data sparsity of LDA and the strong assumption of BTM, we sample a fixed number of documents to form a mini-corpus as a training instance. We also propose a dataset called All News extracted from (Thompson, 2017), in which documents are much longer than 20 Newsgroups. We present an amortized variational inference method for GraphBTM. Our method generates more coherent topics compared with previous approaches. Exper- iments show that the sampling strategy im- proves performance by a large margin. 
    more » « less
  2. Abstract

    Textual data are increasingly common in test data as many assessments include constructed response (CR) items as indicators of participants' understanding. The development of techniques based on natural language processing has made it possible for researchers to rapidly analyse large sets of textual data. One family of statistical techniques for this purpose are probabilistic topic models. Topic modelling is a technique for detecting the latent topic structure in a collection of documents and has been widely used to analyse texts in a variety of areas. The detected topics can reveal primary themes in the documents, and the relative use of topics can be useful in investigating the variability of the documents. Supervised latent Dirichlet allocation (SLDA) is a popular topic model in that family that jointly models textual data and paired responses such as could occur with participants' textual answers to CR items and their rubric‐based scores. SLDA has an assumption of a homogeneous relationship between textual data and paired responses across all documents. This approach, while useful for some purposes, may not be satisfied for situations in which a population has subgroups that have different relationships. In this study, we introduce a new supervised topic model that incorporates finite‐mixture modelling into the SLDA. This new model can detect latent groups of participants that have different relationships between their textual responses and associated scores. The model is illustrated with an example from an analysis of a set of textual responses and paired scores from a middle grades assessment of science inquiry knowledge. A simulation study is presented to investigate the performance of the proposed model under practical testing conditions.

     
    more » « less
  3. null (Ed.)
    Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data related but not limited to scientific discovery. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time step to another. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. Our approach is based on data decomposition in which the data is partitioned into segments, followed by topic modeling on the individual segments. The resulting local models are then combined into a global solution using clustering. The decomposition and resulting parallelization leads to very fast runtime even on very large datasets. Our approach furthermore provides insight into how the composition of topics changes over time and can also be applied using other data partitioning strategies over any discrete features of the data, such as geographic features or classes of users. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers (2,484 documents and 3,280,697 words), seventeen years of computer science journal abstracts (533,588 documents and 46,446,184 words), and to forty years of the PubMed corpus (4,025,976 documents and 386,847,695 words). On the PubMed corpus, we demonstrate the versatility of CLDA by segmenting the data by both time and by journal. Our runtime on this corpus demonstrates an ability to function on very large scale datasets. 
    more » « less
  4. Machine learning techniques underlying Big Data analytics have the potential to benefit data intensive communities in e.g., bioinformatics and neuroscience domain sciences. Today’s innovative advances in these domain communities are increasingly built upon multi-disciplinary knowledge discovery and cross-domain collaborations. Consequently, shortened time to knowledge discovery is a challenge when investigating new methods, developing new tools, or integrating datasets. The challenge for a domain scientist particularly lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics. In this paper, we propose a novel “domain-specific topic model” (DSTM) that can drive conversational agents for users to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplar scientific domains. The goal of DSTM is to perform data mining to obtain meaningful guidance via a chatbot for domain scientists to choose the relevant tools or datasets pertinent to solving a computational and data intensive research problem at hand. Our DSTM is a Bayesian hierarchical model that extends the Latent Dirichlet Allocation (LDA) model and uses a Markov chain Monte Carlo algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include hundreds of papers from reputed journal archives, hundreds of tools and datasets. Through evaluation experiments with a perplexity metric, we show that our model has better generalization performance within a domain for discovering highly specific latent topics. 
    more » « less
  5. null (Ed.)
    Agent-based models (ABM) play a prominent role in guiding critical decision-making and supporting the development of effective policies for better urban resilience and response to the COVID-19 pandemic. However, many ABMs lack realistic representations of human mobility, a key process that leads to physical interaction and subsequent spread of disease. Therefore, we propose the application of Latent Dirichlet Allocation (LDA), a topic modeling technique, to foot-traffic data to develop a realistic model of human mobility in an ABM that simulates the spread of COVID-19. In our novel approach, LDA treats POIs as "words" and agent home census block groups (CBGs) as "documents" to extract "topics" of POIs that frequently appear together in CBG visits. These topics allow us to simulate agent mobility based on the LDA topic distribution of their home CBG. We compare the LDA based mobility model with competitor approaches including a naive mobility model that assumes visits to POIs are random. We find that the naive mobility model is unable to facilitate the spread of COVID-19 at all. Using the LDA informed mobility model, we simulate the spread of COVID-19 and test the effect of changes to the number of topics, various parameters, and public health interventions. By examining the simulated number of cases over time, we find that the number of topics does indeed impact disease spread dynamics, but only in terms of the outbreak's timing. Further analysis of simulation results is needed to better understand the impact of topics on simulated COVID-19 spread. This study contributes to strengthening human mobility representations in ABMs of disease spread. 
    more » « less