Textual information, such as news articles, social media, and online forum discussions, often comes in a form of sequential text streams. Events happening in the real world trigger a set of articles talking about them or related events over a period of time. In the meanwhile, even one event is fading out, another related event could raise public attention. Hence, it is important to leverage the information about how topics influence each other over time to obtain a better understanding and modeling of document streams. In this paper, we explicitly model mutual influence among topics over time, with the purpose to better understand how events emerge, fade and inherit. We propose a temporal point process model, referred to as Correlated Temporal Topic Model (CoTT), to capture the temporal dynamics in a latent topic space. Our model allows for efficient online inference, scaling to continuous time document streams. Extensive experiments on real-world data reveal the effectiveness of our model in recovering meaningful temporal dependency structure among topics and documents. 
                        more » 
                        « less   
                    
                            
                            Sparseness-constrained nonnegative tensor factorization for detecting topics at different time scales
                        
                    
    
            Temporal text data, such as news articles or Twitter feeds, often comprises a mixture of long-lasting trends and transient topics. Effective topic modeling strategies should detect both types and clearly locate them in time. We first demonstrate that nonnegative CANDECOMP/PARAFAC decomposition (NCPD) can automatically identify topics of variable persistence. We then introduce sparseness-constrained NCPD (S-NCPD) and its online variant to control the duration of the detected topics more effectively and efficiently, along with theoretical analysis of the proposed algorithms. Through an extensive study on both semi-synthetic and real-world datasets, we find that our S-NCPD and its online variant can identify both short- and long-lasting temporal topics in a quantifiable and controlled manner, which traditional topic modeling methods are unable to achieve. Additionally, the online variant of S-NCPD shows a faster reduction in reconstruction error and results in more coherent topics compared to S-NCPD, thus achieving both computational efficiency and quality of the resulting topics. Our findings indicate that S-NCPD and its online variant are effective tools for detecting and controlling the duration of topics in temporal text data, providing valuable insights into both persistent and transient trends. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10529237
- Publisher / Repository:
- Frontiers
- Date Published:
- Journal Name:
- Frontiers in Applied Mathematics and Statistics
- Volume:
- 10
- ISSN:
- 2297-4687
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze the Multi-Attribute Data Set on Statisticians (MADStat), a data set on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods to MADStat leads to interesting findings. For example, we identified 11 representative topics in statistics. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of 11 topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research from 1975 to 2015, from a text analysis perspective.more » « less
- 
            Existing topic modeling and text segmentation methodologies generally require large datasets for training, limiting their capabilities when only small collections of text are available. In this work, we reexamine the inter-related problems of “topic identification” and “text segmentation” for sparse document learning, when there is a single new text of interest. In developing a methodology to handle single documents, we face two major challenges. First is sparse information : with access to only one document, we cannot train traditional topic models or deep learning algorithms. Second is significant noise : a considerable portion of words in any single document will produce only noise and not help discern topics or segments. To tackle these issues, we design an unsupervised, computationally efficient methodology called Biclustering Approach to Topic modeling and Segmentation (BATS). BATS leverages three key ideas to simultaneously identify topics and segment text: (i) a new mechanism that uses word order information to reduce sample complexity, (ii) a statistically sound graph-based biclustering technique that identifies latent structures of words and sentences, and (iii) a collection of effective heuristics that remove noise words and award important words to further improve performance. Experiments on six datasets show that our approach outperforms several state-of-the-art baselines when considering topic coherence, topic diversity, segmentation, and runtime comparison metrics.more » « less
- 
            Recent policy initiatives have acknowledged the importance of disaggregating data pertaining to diverse Asian ethnic communities to gain a more comprehensive understanding of their current status and to improve their overall well-being. However, research on anti-Asian racism has thus far fallen short of properly incorporating data disaggregation practices. Our study addresses this gap by collecting 12-month-long data from X (formerly known as Twitter) that contain diverse sub-ethnic group representations within Asian communities. In this dataset, we break down anti-Asian toxic messages based on both temporal and ethnic factors and conduct a series of comparative analyses of toxic messages, targeting different ethnic groups. Using temporal persistence analysis, 𝑛-gram-based correspondence analysis, and topic modeling, this study provides compelling evidence that anti-Asian messages comprise various distinctive narratives. Certain messages targeting sub-ethnic Asian groups entail different topics that distinguish them from those targeting Asians in a generic manner or those aimed at major ethnic groups, such as Chinese and Indian. By introducing several techniques that facilitate comparisons of online anti-Asian hate towards diverse ethnic communities, this study highlights the importance of taking a nuanced and disaggregated approach for understanding racial hatred to formulate effective mitigation strategies.more » « less
- 
            This paper investigates topic modeling within a noisy domain. The goal is to generate topics that maximize topic coherence while introducing only a small amount of noise. The problem is motivated by the practical setting of short, noisy tweets, where it is important to generate topics containing a larger number of content words than noise words. For the most general version of this problem, we propose a new method, λ-CLIQ. It is a simple variant of the kclique percolation algorithm that employs for quasi-cliques during graph decomposition and percolation based on λ, a graph property variant. While the topics generated using our base algorithm are highly coherent, they are often contain too few words. To increase topic size, we add a post processing step that augments identified topic words using locally trained embeddings. We show that both λ-CLIQ and λ-CLIQ+ outperform the state of the art in terms of topic coherence on three distinct Twitter data sets.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    