skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Scalable System for Executing and Scoring K-Means Clustering Techniques and Its Impact on Applications in Agriculture
We present Centaurus a scalable, open source, clustering service for K-means clustering of correlated, multidimensional data. Centaurus provides users with automatic deployment via public or private cloud resources, model selection (using Bayesian information criterion), and data visualisation. We apply Centaurus to a real-world, agricultural analytics application and compare its results to the industry standard clustering approach. The application uses soil electrical conductivity (EC) measurements, GPS coordinates, and elevation data from a field to produce a map of differing soil zones (so that management can be specialised for each). We use Centaurus and these datasets to empirically evaluate the impact of considering multiple K-means variants and large numbers of experiments. We show that Centaurus yields more consistent and useful clusterings than the competitive approach for use in zone-based soil decision-support applications where a hard decision is required.  more » « less
Award ID(s):
1703560
PAR ID:
10091230
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International journal of big data intelligence
ISSN:
2053-1397
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We provide a new bi-criteria O(log2k) competitive algorithm for explainable k-means clustering. Explainable k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable k-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained k-means clustering. We show that this bound is almost optimal. 
    more » « less
  2. null (Ed.)
    We consider the problem of explainable k-medians and k-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the k-medians or k-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with k-medians with ℓ1 norm and O(k) competitive with k-means. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for k-medians with ℓ2 norm. Our first algorithm is near-optimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for k-medians; in this work, we prove a lower bound of Ω(k) for k-means. We also provide a lower bound of Ω(log k) for k-medians with ℓ2 norm. 
    more » « less
  3. Ground motion selection has become increasingly central to the assessment of earthquake resilience. The selection of ground motion records for use in nonlinear dynamic analysis significantly affects structural response. This, in turn, will impact the outcomes of earthquake resilience analysis. This paper presents a new ground motion clustering algorithm, which can be embedded in current ground motion selection methods to properly select representative ground motion records that a structure of interest will probabilistically experience. The proposed clustering-based ground motion selection method includes four main steps: 1) leveraging domain-specific knowledge to pre-select candidate ground motions; 2) using a convolutional autoencoder to learn low-dimensional underlying characteristics of candidate ground motions’ response spectra – i.e., latent features; 3) performing k-means clustering to classify the learned latent features, equivalent to cluster the response spectra of candidate ground motions; and 4) embedding the clusters in the conditional spectra-based ground motion selection. The selected ground motions can represent a given hazard level well (by matching conditional spectra) and fully describe the complete set of candidate ground motions. Three case studies for modified, pulse-type, and non-pulse-type ground motions are designed to evaluate the performance of the proposed ground motion clustering algorithm (convolutional autoencoder + k-means). Considering the limited number of pre-selected candidate ground motions in the last two case studies, the response spectra simulation and transfer learning are used to improve the stability and reproducibility of the proposed ground motion clustering algorithm. The results of the three case studies demonstrate that the convolutional autoencoder + k-means can 1) achieve 100% accuracy in classifying ground motion response spectra, 2) correctly determine the optimal number of clusters, and 3) outperform established clustering algorithms (i.e., autoencoder + k-means, time series k-means, spectral clustering, and k-means on ground motion influence factors). Using the proposed clustering-based ground motion selection method, an application is performed to select ground motions for a structure in San Francisco, California. The developed user-friendly codes are published for practical use. 
    more » « less
  4. Event logs contain abundant information, such as activity names, time stamps, activity executors, etc. However, much of existing trace clustering research has been focused on applying activity names to assist process scenarios discovery. In addition, many existing trace clustering algorithms commonly used in the literature, such as k-means clustering approach, require prior knowledge about the number of process scenarios existed in the log, which sometimes are not known aprior. This paper presents a two-phase approach that obtains timing information from event logs and uses the information to assist process scenario discoveries without requiring any prior knowledge about process scenarios. We use five real-life event logs to compare the performance of the proposed two-phase approach for process scenario discoveries with the commonly used k-means clustering approach in terms of model’s harmonic mean of the weighted average fitness and precision, i.e., the F1 score. The experiment data shows that (1) the process scenario models obtained with the additional timing information have both higher fitness and precision scores than the models obtained without the timing information; (2) the two-phase approach not only removes the need for prior information related to k, but also results in a comparable F1 score compared to the optimal k-means approach with the optimal k obtained through exhaustive search. 
    more » « less
  5. A new method for intentional islanding of power grids is proposed, based on a data-driven and inherently geometric concept of data depth. The utility of the new depth-based islanding is illustrated in application to the Italian power grid. It is found that spectral clustering with data depths outperforms spectral clustering with k-means in terms of k-way expansion. Directions on how the k-depths can be extended to multilayer grids in a tensor representation are outlined. 
    more » « less