skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: K-means and hierarchical clustering of f0 contours
Cluster analysis on time-series f0 data is an increasingly popular method in intonation research. There are a number of methodological decisions to take when applying cluster analysis. Crucially, these decisions may affect the clustering results, potentially also the conclusions of the research. This paper investigates the extent to which the choice for either K-means or hierarchical clustering, two of the most popular clustering methods, leads to grouping differences that are potentially relevant for intonation research. This is tested using a dataset of f0 measures taken from imitated intonation patterns in American English. The analysis concerns a generic correlation test between K-means and hierarchical clustering outcomes as well as a number of specific measures assessing partitioning quality and f0 contour differences. The results show that both cluster methods generally show very similar outcomes, although considerable differences for specific clusterings might occur.  more » « less
Award ID(s):
1944773
PAR ID:
10598516
Author(s) / Creator(s):
; ;
Publisher / Repository:
International Speech Communication Association
Date Published:
Page Range / eLocation ID:
1520 to 1524
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single- linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods. We introduce efficient distributed algorithms for bisecting k-means, k- median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice. 
    more » « less
  2. Ground motion selection has become increasingly central to the assessment of earthquake resilience. The selection of ground motion records for use in nonlinear dynamic analysis significantly affects structural response. This, in turn, will impact the outcomes of earthquake resilience analysis. This paper presents a new ground motion clustering algorithm, which can be embedded in current ground motion selection methods to properly select representative ground motion records that a structure of interest will probabilistically experience. The proposed clustering-based ground motion selection method includes four main steps: 1) leveraging domain-specific knowledge to pre-select candidate ground motions; 2) using a convolutional autoencoder to learn low-dimensional underlying characteristics of candidate ground motions’ response spectra – i.e., latent features; 3) performing k-means clustering to classify the learned latent features, equivalent to cluster the response spectra of candidate ground motions; and 4) embedding the clusters in the conditional spectra-based ground motion selection. The selected ground motions can represent a given hazard level well (by matching conditional spectra) and fully describe the complete set of candidate ground motions. Three case studies for modified, pulse-type, and non-pulse-type ground motions are designed to evaluate the performance of the proposed ground motion clustering algorithm (convolutional autoencoder + k-means). Considering the limited number of pre-selected candidate ground motions in the last two case studies, the response spectra simulation and transfer learning are used to improve the stability and reproducibility of the proposed ground motion clustering algorithm. The results of the three case studies demonstrate that the convolutional autoencoder + k-means can 1) achieve 100% accuracy in classifying ground motion response spectra, 2) correctly determine the optimal number of clusters, and 3) outperform established clustering algorithms (i.e., autoencoder + k-means, time series k-means, spectral clustering, and k-means on ground motion influence factors). Using the proposed clustering-based ground motion selection method, an application is performed to select ground motions for a structure in San Francisco, California. The developed user-friendly codes are published for practical use. 
    more » « less
  3. Skarnitzl, R. (Ed.)
    We test a subset of intonational contrasts proposed in the Autosegmental-Metrical model for American English for evidence of contrast enhancement in phonologically and phonetically longer vs. shorter intervals. F0 trajectories were assessed from 32 speakers’ imitated productions of six tonally distinct tunes, e.g., HHH, HHL. Maximally three tune shapes emerge from clustering analyses of imitated f0 trajectories, each cluster comprising imitations of two phonetically similar but phonologically distinct tunes. We find enhancement of tune contrasts between the emergent clusters in measures of f0 differences (RMSD, end f0, center of gravity). There is no evidence of enhancement for phonetically similar tunes grouped within the same cluster, though fine-grained phonetic distinctions are detected for these “lost” tune contrasts, suggesting a reanalysis as within-category variation. 
    more » « less
  4. Systematic reviews are a time-consuming yet effective approach to understanding research trends. While researchers have investigated how to speed up the process of screening studies for potential inclusion, few have focused on to what extent we can use algorithms to extract data instead of human coders. In this study, we explore to what extent analyses and algorithms can produce results similar to human data extraction during a scoping review—a type of systematic review aimed at understanding the nature of the field rather than the efficacy of an intervention—in the context of a never before analyzed sample of studies that were intended for a scoping review. Specifically, we tested five approaches: bibliometric analysis with VOSviewer, latent Dirichlet allocation (LDA) with bag of words, k-means clustering with TF-IDF, Sentence-BERT, or SPECTER, hierarchical clustering with Sentence-BERT, and BERTopic. Our results showed that topic modeling approaches (LDA/BERTopic) and k-means clustering identified specific, but often narrow research areas, leaving a substantial portion of the sample unclassified or in unclear topics. Meanwhile, bibliometric analysis and hierarchical clustering with SBERT were more informative for our purposes, identifying key author networks and categorizing studies into distinct themes as well as reflecting the relationships between themes, respectively. Overall, we highlight the capabilities and limitations of each method and discuss how these techniques can complement traditional human data extraction methods. We conclude that the analyses tested here likely cannot fully replace human data extraction in scoping reviews but serve as valuable supplements. 
    more » « less
  5. The size and amount of data captured from numerous sources has created a situation where the large quantity of data challenges our ability to understand the meaning within the data. This has motivated studies for mechanized data analysis and in particular for the clustering, or partitioning, of data into related groups. In fact, the size of the data has grown to the point where it is now often necessary to stream the data through the system for online and high speed analysis. This paper explores the application of approximate methods for the stream clustering of high-dimensional data (feature sizes contains 100+ measures). In particular, the algorithm that has been developed, called streamingRPHash, combines Random Projection with Locality Sensitive Hashing and a count-min sketch to implement a high-performance method for the parallel and distributed clustering of streaming data in a MapReduce framework. streamingRPHash is able to perform clustering at a rate much faster than traditional clustering algorithms such as K-Means. streamingRPHash provides clustering results that are only slightly less accurate than K-Means, but in runtimes that are nearly half that required by K-Means. The performance advantage for streamingRPHash becomes even more significant as the dimensionality of the input data stream increases. Furthermore, the experimental results show that streamingRPHash has a near linear speedup relative to the number of CPU cores. This speedup efficiency is possible because the approximate methods used in streamingRPHash allow independent and largely unsynchronized analyses to be performed on each streamed data vectors. 
    more » « less