skip to main content


Title: A Deterministic Self-Organizing Map Approach and its Application on Satellite Data based Cloud Type Classification
A self-organizing map (SOM) is a type of competitive artificial neural network, which projects the high- dimensional input space of the training samples into a low-dimensional space with the topology relations preserved. This makes SOMs supportive of organizing and visualizing complex data sets and have been pervasively used among numerous disciplines with different applications. Notwithstanding its wide applications, the self-organizing map is perplexed by its inherent randomness, which produces dissimilar SOM patterns even when being trained on identical training samples with the same parameters every time, and thus causes usability concerns for other domain practitioners and precludes more potential users from exploring SOM based applications in a broader spectrum. Motivated by this practical concern, we propose a deterministic approach as a supplement to the standard self-organizing map. In accordance with the theoretical design, the experimental results with satellite cloud data demonstrate the effective and efficient organization as well as simplification capabilities of the proposed approach.  more » « less
Award ID(s):
1730250
NSF-PAR ID:
10110748
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of 2018 IEEE International Conference on Big Data (BigData 2018)
Page Range / eLocation ID:
2027 to 2034
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    There is demand for scalable algorithms capable of clustering and analyzing large time series data. The Kohonen self-organizing map (SOM) is an unsupervised artificial neural network for clustering, visualizing, and reducing the dimensionality of complex data. Like all clustering methods, it requires a measure of similarity between input data (in this work time series). Dynamic time warping (DTW) is one such measure, and a top performer that accommodates distortions when aligning time series. Despite its popularity in clustering, DTW is limited in practice because the runtime complexity is quadratic with the length of the time series. To address this, we present a new a self-organizing map for clustering TIME Series, called SOMTimeS, which uses DTW as the distance measure. The method has similar accuracy compared with other DTW-based clustering algorithms, yet scales better and runs faster. The computational performance stems from the pruning of unnecessary DTW computations during the SOM’s training phase. For comparison, we implement a similar pruning strategy for K-means, and call the latter K-TimeS. SOMTimeS and K-TimeS pruned 43% and 50% of the total DTW computations, respectively. Pruning effectiveness, accuracy, execution time and scalability are evaluated using 112 benchmark time series datasets from the UC Riverside classification archive, and show that for similar accuracy, a 1.8$$\times$$×speed-up on average for SOMTimeS and K-TimeS, respectively with that rates vary between 1$$\times$$×and 18$$\times$$×depending on the dataset. We also apply SOMTimeS to a healthcare study of patient-clinician serious illness conversations to demonstrate the algorithm’s utility with complex, temporally sequenced natural language.

     
    more » « less
  2. We extend the self-organizing mapping algorithm to the problem of visualizing data on Grassmann manifolds. In this setting, a collection of k points in n-dimensions is represented by a k-dimensional subspace, e.g., via the singular value or QR-decompositions. Data assembled in this way is challenging to visualize given abstract points on the Grassmannian do not reside in Euclidean space. The extension of the SOM algorithm to this geometric setting only requires that distances between two points can be measured and that any given point can be moved towards a presented pattern. The similarity between two points on the Grassmannian is measured in terms of the principal angles between subspaces, e.g., the chordal distance. Further, we employ a formula for moving one subspace towards another along the shortest path, i.e., the geodesic between two points on the Grassmannian. This enables a faithful implementation of the SOM approach for visualizing data consisting of k-dimensional subspaces of n-dimensional Euclidean space. We illustrate the resulting algorithm on a hyperspectral imaging application. 
    more » « less
  3. Abstract. In magnetospheric missions, burst-mode data sampling should be triggered in the presence of processes of scientific or operational interest. We present an unsupervised classification method for magnetospheric regions that could constitute the first step of a multistep method for the automatic identification of magnetospheric processes of interest. Our method is based on self-organizing maps (SOMs), and we test it preliminarily on data points from global magnetospheric simulations obtained with the OpenGGCM-CTIM-RCM code. The dimensionality of the data is reduced with principal component analysis before classification. The classification relies exclusively on local plasma properties at the selected data points, without information on their neighborhood or on their temporal evolution. We classify the SOM nodes into an automatically selected number of classes, and we obtain clusters that map to well-defined magnetospheric regions. We validate our classification results by plotting the classified data in the simulated space and by comparing with k-means classification. For the sake of result interpretability, we examine the SOM feature maps (magnetospheric variables are called features in the context of classification), and we use them to unlock information on the clusters. We repeat the classification experiments using different sets of features, we quantitatively compare different classification results, and we obtain insights on which magnetospheric variables make more effective features for unsupervised classification. 
    more » « less
  4. Abstract

    Biofilm formation is a major cause of hospital‐acquired infections. Research into biofilm‐resistant materials is therefore critical to reduce the frequency of these events. Polymer microarrays offer a high‐throughput approach to enable the efficient discovery of novel biofilm‐resistant polymers. Herein, bacterial attachment and surface chemistry are studied for a polymer microarray to improve the understanding ofPseudomonas aeruginosabiofilm formation on a diverse set of polymeric surfaces. The relationships between time‐of‐flight secondary ion mass spectrometry (ToF‐SIMS) data and biofilm formation are analyzed using linear multivariate analysis (partial least squares [PLS] regression) and a nonlinear self‐organizing map (SOM). The SOM models revealed several combinations of fragment ions that are positively or negatively associated with bacterial biofilm formation, which are not identified by PLS. With these insights, a second PLS model is calculated, in which interactions between key fragments (identified by the SOM) are explicitly considered. Inclusion of these terms improved the PLS model performance and shows that, without such terms, certain key fragment ions correlated with bacterial attachment may not be identified. The chemical insights provided by the combination of PLS regression and SOM will be useful for the design of materials that support negligible pathogen attachment.

     
    more » « less
  5. null (Ed.)
    ABSTRACT Cosmological analyses of galaxy surveys rely on knowledge of the redshift distribution of their galaxy sample. This is usually derived from a spectroscopic and/or many-band photometric calibrator survey of a small patch of sky. The uncertainties in the redshift distribution of the calibrator sample include a contribution from shot noise, or Poisson sampling errors, but, given the small volume they probe, they are dominated by sample variance introduced by large-scale structures. Redshift uncertainties have been shown to constitute one of the leading contributions to systematic uncertainties in cosmological inferences from weak lensing and galaxy clustering, and hence they must be propagated through the analyses. In this work, we study the effects of sample variance on small-area redshift surveys, from theory to simulations to the COSMOS2015 data set. We present a three-step Dirichlet method of resampling a given survey-based redshift calibration distribution to enable the propagation of both shot noise and sample variance uncertainties. The method can accommodate different levels of prior confidence on different redshift sources. This method can be applied to any calibration sample with known redshifts and phenotypes (i.e. cells in a self-organizing map, or some other way of discretizing photometric space), and provides a simple way of propagating prior redshift uncertainties into cosmological analyses. As a worked example, we apply the full scheme to the COSMOS2015 data set, for which we also present a new, principled SOM algorithm designed to handle noisy photometric data. We make available a catalogue of the resulting resamplings of the COSMOS2015 galaxies. 
    more » « less