Optimal Estimation of Sparse Topic Models

Bing, X; Bunea, F; Wegkamp, M

Citation Details

Topic models have become popular tools for dimension reduction and exploratory analysis of text data which consists in observed frequencies of a vocabulary of p words in n documents, stored in a p×n matrix. The main premise is that the mean of this data matrix can be factorized into a product of two non-negative matrices: a p×K word-topic matrix A and a K×n topic-document matrix W. This paper studies the estimation of A that is possibly element-wise sparse, and the number of topics K is unknown. In this under-explored context, we derive a new minimax lower bound for the estimation of such A and propose a new computationally efficient algorithm for its recovery. We derive a finite sample upper bound for our estimator, and show that it matches the minimax lower bound in many scenarios. Our estimate adapts to the unknown sparsity of A and our analysis is valid for any finite n, p, K and document lengths. Empirical results on both synthetic data and semi-synthetic data show that our proposed estimator is a strong competitor of the existing state-of-the-art algorithms for both non-sparse A and sparse A, and has superior performance is many scenarios of interest. more »

Award ID(s):: 2015195

PAR ID:: 10274094

Author(s) / Creator(s):: Bing, X; Bunea, F; Wegkamp, M

Editor(s):: Dalalyan, Aynak

Date Published:: 2020-07-01

Journal Name:: Journal of machine learning research

Volume:: 21

Issue:: 177

ISSN:: 1532-4435

Page Range / eLocation ID:: 1 - 45

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
The DOI is not currently available.

More Like this