skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Does deep learning help topic extraction? A kernel k-means clustering method with word embedding,
Topic extraction presents challenges for the bibliometric community, and its performance still depends on human intervention and its practical areas. This paper proposes a novel kernel k-means clustering method incorporated with a word embedding model to create a solution that effectively extracts topics from bibliometric data. The experimental results ofa comparison of this method with four clustering baselines (i.e., k-means, fuzzy c-means, principal component analysis, and topic models) on two bibliometric datasets demonstrate its effectiveness across either a relatively broad range of disciplines or a given domain. An empirical study on bibliometric topic extraction from articles published by three top-tier bibliometric journals between 2000 and 2017, supported by expert knowledge-based evaluations, provides supplemental evidence of the method’s ability on topic extraction. Additionally, this empirical analysis reveals insights into both overlapping and diverse research interests among the three journals that would benefit journal publishers, editorial boards, and research communities.  more » « less
Award ID(s):
1759960
PAR ID:
10084143
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Journal of informetrics
Volume:
12
ISSN:
1875-5879
Page Range / eLocation ID:
1099-1117
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Systematic reviews are a time-consuming yet effective approach to understanding research trends. While researchers have investigated how to speed up the process of screening studies for potential inclusion, few have focused on to what extent we can use algorithms to extract data instead of human coders. In this study, we explore to what extent analyses and algorithms can produce results similar to human data extraction during a scoping review—a type of systematic review aimed at understanding the nature of the field rather than the efficacy of an intervention—in the context of a never before analyzed sample of studies that were intended for a scoping review. Specifically, we tested five approaches: bibliometric analysis with VOSviewer, latent Dirichlet allocation (LDA) with bag of words, k-means clustering with TF-IDF, Sentence-BERT, or SPECTER, hierarchical clustering with Sentence-BERT, and BERTopic. Our results showed that topic modeling approaches (LDA/BERTopic) and k-means clustering identified specific, but often narrow research areas, leaving a substantial portion of the sample unclassified or in unclear topics. Meanwhile, bibliometric analysis and hierarchical clustering with SBERT were more informative for our purposes, identifying key author networks and categorizing studies into distinct themes as well as reflecting the relationships between themes, respectively. Overall, we highlight the capabilities and limitations of each method and discuss how these techniques can complement traditional human data extraction methods. We conclude that the analyses tested here likely cannot fully replace human data extraction in scoping reviews but serve as valuable supplements. 
    more » « less
  2. Cluster analysis on time-series f0 data is an increasingly popular method in intonation research. There are a number of methodological decisions to take when applying cluster analysis. Crucially, these decisions may affect the clustering results, potentially also the conclusions of the research. This paper investigates the extent to which the choice for either K-means or hierarchical clustering, two of the most popular clustering methods, leads to grouping differences that are potentially relevant for intonation research. This is tested using a dataset of f0 measures taken from imitated intonation patterns in American English. The analysis concerns a generic correlation test between K-means and hierarchical clustering outcomes as well as a number of specific measures assessing partitioning quality and f0 contour differences. The results show that both cluster methods generally show very similar outcomes, although considerable differences for specific clusterings might occur. 
    more » « less
  3. Ground motion selection has become increasingly central to the assessment of earthquake resilience. The selection of ground motion records for use in nonlinear dynamic analysis significantly affects structural response. This, in turn, will impact the outcomes of earthquake resilience analysis. This paper presents a new ground motion clustering algorithm, which can be embedded in current ground motion selection methods to properly select representative ground motion records that a structure of interest will probabilistically experience. The proposed clustering-based ground motion selection method includes four main steps: 1) leveraging domain-specific knowledge to pre-select candidate ground motions; 2) using a convolutional autoencoder to learn low-dimensional underlying characteristics of candidate ground motions’ response spectra – i.e., latent features; 3) performing k-means clustering to classify the learned latent features, equivalent to cluster the response spectra of candidate ground motions; and 4) embedding the clusters in the conditional spectra-based ground motion selection. The selected ground motions can represent a given hazard level well (by matching conditional spectra) and fully describe the complete set of candidate ground motions. Three case studies for modified, pulse-type, and non-pulse-type ground motions are designed to evaluate the performance of the proposed ground motion clustering algorithm (convolutional autoencoder + k-means). Considering the limited number of pre-selected candidate ground motions in the last two case studies, the response spectra simulation and transfer learning are used to improve the stability and reproducibility of the proposed ground motion clustering algorithm. The results of the three case studies demonstrate that the convolutional autoencoder + k-means can 1) achieve 100% accuracy in classifying ground motion response spectra, 2) correctly determine the optimal number of clusters, and 3) outperform established clustering algorithms (i.e., autoencoder + k-means, time series k-means, spectral clustering, and k-means on ground motion influence factors). Using the proposed clustering-based ground motion selection method, an application is performed to select ground motions for a structure in San Francisco, California. The developed user-friendly codes are published for practical use. 
    more » « less
  4. null (Ed.)
    Efforts to involve data science in policy analysis can be traced back decades but transforming analytic findings into decisions is still a far from straightforward task. Data-driven decision-making requires understanding approaches, practices, and research results from many disciplines, which makes it interesting to investigate whether data science and policy analysis are moving in parallel or whether their pathways have intersected. Our investigation, from a bibliometric perspective, is driven by a comprehensive set of research questions, and we have designed an intelligent bibliometric framework that includes a series of traditional bibliometric approaches and a novel method of charting the evolutionary pathways of scientific innovation, which is used to identify predecessor-descendant relationships in technological topics. Our investigation reveals that data science and policy analysis have intersecting lines, and it can foresee that a cross-disciplinary direction in which policy analysis interacting with data science has become an emergent area in both communities. However, equipped with advanced data analytic techniques, data scientists are moving faster and further than policy analysts. The empirical insights derived from our research should be beneficial to academic researchers and journal editors in related research communities, as well as policy-makers in research institutions and funding agencies. 
    more » « less
  5. The K-subspaces (KSS) method is a generalization of the K-means method for subspace clustering. In this work, we present local convergence analysis and a recovery guarantee for KSS, assuming data are generated by the semi-random union of subspaces model, where N points are randomly sampled from K ≥ 2 overlapping subspaces. We show that if the initial assignment of the KSS method lies within a neighborhood of a true clustering, it converges at a superlinear rate and finds the correct clustering within (log logN) iterations with high probability. Moreover, we propose a thresholding inner-product based spectral method for initialization and prove that it produces a point in this neighborhood. We also present numerical results of the studied method to support our theoretical developments. 
    more » « less