Discovering the latent topics within texts has been a fundamental task for many applica- tions. However, conventional topic models suffer different problems in different settings. The Latent Dirichlet Allocation (LDA) may not work well for short texts due to the data sparsity (i.e., the sparse word co-occurrence patterns in short documents). The Biterm Topic Model (BTM) learns topics by mod- eling the word-pairs named biterms in the whole corpus. This assumption is very strong when documents are long with rich topic in- formation and do not exhibit the transitivity of biterms. In this paper, we propose a novel way called GraphBTM to represent biterms as graphs and design Graph Convolutional Net- works (GCNs) with residual connections to extract transitive features from biterms. To overcome the data sparsity of LDA and the strong assumption of BTM, we sample a fixed number of documents to form a mini-corpus as a training instance. We also propose a dataset called All News extracted from (Thompson, 2017), in which documents are much longer than 20 Newsgroups. We present an amortized variational inference method for GraphBTM. Our method generates more coherent topics compared with previous approaches. Exper- iments show that the sampling strategy im- proves performancemore »
GraphBTM: Graph Enhanced Autoencoded Variational Inference for Biterm Topic Model
Discovering the latent topics within texts has been a fundamental task for many applica- tions. However, conventional topic models suffer different problems in different settings. The Latent Dirichlet Allocation (LDA) may not work well for short texts due to the data sparsity (i.e., the sparse word co-occurrence patterns in short documents). The Biterm Topic Model (BTM) learns topics by mod- eling the word-pairs named biterms in the whole corpus. This assumption is very strong when documents are long with rich topic in- formation and do not exhibit the transitivity of biterms. In this paper, we propose a novel way called GraphBTM to represent biterms as graphs and design Graph Convolutional Net- works (GCNs) with residual connections to extract transitive features from biterms. To overcome the data sparsity of LDA and the strong assumption of BTM, we sample a fixed number of documents to form a mini-corpus as a training instance. We also propose a dataset called All News extracted from (Thompson, 2017), in which documents are much longer than 20 Newsgroups. We present an amortized variational inference method for GraphBTM. Our method generates more coherent topics compared with previous approaches. Exper- iments show that the sampling strategy im- proves performance more »
- Award ID(s):
- 1747783
- Publication Date:
- NSF-PAR ID:
- 10084511
- Journal Name:
- Conference on Empirical Methods in Natural Language Processing (EMNLP 2018)
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Topic modeling, a method for extracting the underlying themes from a collection of documents, is an increasingly important component of the design of intelligent systems enabling the sense-making of highly dynamic and diverse streams of text data related but not limited to scientific discovery. Traditional methods such as Dynamic Topic Modeling (DTM) do not lend themselves well to direct parallelization because of dependencies from one time step to another. In this paper, we introduce and empirically analyze Clustered Latent Dirichlet Allocation (CLDA), a method for extracting dynamic latent topics from a collection of documents. Our approach is based on data decomposition in which the data is partitioned into segments, followed by topic modeling on the individual segments. The resulting local models are then combined into a global solution using clustering. The decomposition and resulting parallelization leads to very fast runtime even on very large datasets. Our approach furthermore provides insight into how the composition of topics changes over time and can also be applied using other data partitioning strategies over any discrete features of the data, such as geographic features or classes of users. In this paper CLDA is applied successfully to seventeen years of NIPS conference papers (2,484 documentsmore »
-
Machine learning techniques underlying Big Data analytics have the potential to benefit data intensive communities in e.g., bioinformatics and neuroscience domain sciences. Today’s innovative advances in these domain communities are increasingly built upon multi-disciplinary knowledge discovery and cross-domain collaborations. Consequently, shortened time to knowledge discovery is a challenge when investigating new methods, developing new tools, or integrating datasets. The challenge for a domain scientist particularly lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics. In this paper, we propose a novel “domain-specific topic model” (DSTM) that can drive conversational agents for users to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplar scientific domains. The goal of DSTM is to perform data mining to obtain meaningful guidance via a chatbot for domain scientists to choose the relevant tools or datasets pertinent to solving a computational and data intensive research problem at hand. Our DSTM is a Bayesian hierarchical model that extends the Latent Dirichlet Allocation (LDA) model and uses a Markov chain Monte Carlo algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTMmore »
-
Agent-based models (ABM) play a prominent role in guiding critical decision-making and supporting the development of effective policies for better urban resilience and response to the COVID-19 pandemic. However, many ABMs lack realistic representations of human mobility, a key process that leads to physical interaction and subsequent spread of disease. Therefore, we propose the application of Latent Dirichlet Allocation (LDA), a topic modeling technique, to foot-traffic data to develop a realistic model of human mobility in an ABM that simulates the spread of COVID-19. In our novel approach, LDA treats POIs as "words" and agent home census block groups (CBGs) as "documents" to extract "topics" of POIs that frequently appear together in CBG visits. These topics allow us to simulate agent mobility based on the LDA topic distribution of their home CBG. We compare the LDA based mobility model with competitor approaches including a naive mobility model that assumes visits to POIs are random. We find that the naive mobility model is unable to facilitate the spread of COVID-19 at all. Using the LDA informed mobility model, we simulate the spread of COVID-19 and test the effect of changes to the number of topics, various parameters, and public health interventions.more »
-
Dalalyan, Aynak (Ed.)Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of p words in n documents, stored in a p×n matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a p×K word-topic matrix A and a K×n topic-document matrix W. This paper studies the estimation of A that is possibly element-wise sparse, and the number of topics K is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such A and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of A and our analysis is valid for any finite n, p, K and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse A and sparse A, and has superior performance is many scenarios of interest.