skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Unsupervised classification of simulated magnetospheric regions
Abstract. In magnetospheric missions, burst-mode data sampling should be triggered in the presence of processes of scientific or operational interest. We present an unsupervised classification method for magnetospheric regions that could constitute the first step of a multistep method for the automatic identification of magnetospheric processes of interest. Our method is based on self-organizing maps (SOMs), and we test it preliminarily on data points from global magnetospheric simulations obtained with the OpenGGCM-CTIM-RCM code. The dimensionality of the data is reduced with principal component analysis before classification. The classification relies exclusively on local plasma properties at the selected data points, without information on their neighborhood or on their temporal evolution. We classify the SOM nodes into an automatically selected number of classes, and we obtain clusters that map to well-defined magnetospheric regions. We validate our classification results by plotting the classified data in the simulated space and by comparing with k-means classification. For the sake of result interpretability, we examine the SOM feature maps (magnetospheric variables are called features in the context of classification), and we use them to unlock information on the clusters. We repeat the classification experiments using different sets of features, we quantitatively compare different classification results, and we obtain insights on which magnetospheric variables make more effective features for unsupervised classification.  more » « less
Award ID(s):
1919310
PAR ID:
10353541
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Annales Geophysicae
Volume:
39
Issue:
5
ISSN:
1432-0576
Page Range / eLocation ID:
861 to 881
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Luigi Martelli, Pier (Ed.)
    Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. In many machine learning problems, one has to work with data of different types, including continuous, discrete, and categorical data. Further, it is often the case that many of these data are missing from the database. This paper proposes a Gaussian process framework that efficiently captures the information from mixed numerical and categorical data that effectively incorporates missing variables. First, we propose a generative model for the mixed-type data. The generative model exploits Gaussian processes with kernels constructed from the latent vectors. We also propose a method for inference of the unknowns, and in its implementation, we rely on a sparse spectrum approximation of the Gaussian processes and variational inference. We demonstrate the performance of the method for both supervised and unsupervised tasks. First, we investigate the imputation of missing variables in an unsupervised setting, and then we show the results of joint imputation and classification on IBM employee data. 
    more » « less
  3. ABSTRACT We explore unsupervised machine learning for galaxy morphology analyses using a combination of feature extraction with a vector-quantized variational autoencoder (VQ-VAE) and hierarchical clustering (HC). We propose a new methodology that includes: (1) consideration of the clustering performance simultaneously when learning features from images; (2) allowing for various distance thresholds within the HC algorithm; (3) using the galaxy orientation to determine the number of clusters. This set-up provides 27 clusters created with this unsupervised learning that we show are well separated based on galaxy shape and structure (e.g. Sérsic index, concentration, asymmetry, Gini coefficient). These resulting clusters also correlate well with physical properties such as the colour–magnitude diagram, and span the range of scaling relations such as mass versus size amongst the different machine-defined clusters. When we merge these multiple clusters into two large preliminary clusters to provide a binary classification, an accuracy of $$\sim 87{{\ \rm per\ cent}}$$ is reached using an imbalanced data set, matching real galaxy distributions, which includes 22.7 per cent early-type galaxies and 77.3 per cent late-type galaxies. Comparing the given clusters with classic Hubble types (ellipticals, lenticulars, early spirals, late spirals, and irregulars), we show that there is an intrinsic vagueness in visual classification systems, in particular galaxies with transitional features such as lenticulars and early spirals. Based on this, the main result in this work is not how well our unsupervised method matches visual classifications and physical properties, but that the method provides an independent classification that may be more physically meaningful than any visually based ones. 
    more » « less
  4. Outlier detection is critical in real world. Due to the existence of many outlier detection techniques which often return different results for the same data set, the users have to address the problem of determining which among these techniques is the best suited for their task and tune its parameters. This is particularly challenging in the unsupervised setting, where no labels are available for cross-validation needed for such method and parameter optimization. In this work, we propose AutoOD which uses the existing unsupervised detection techniques to automatically produce high quality outliers without any human tuning. AutoOD's fundamentally new strategy unifies the merits of unsupervised outlier detection and supervised classification within one integrated solution. It automatically tests a diverse set of unsupervised outlier detectors on a target data set, extracts useful signals from their combined detection results to reliably capture key differences between outliers and inliers. It then uses these signals to produce a "custom outlier classifier" to classify outliers, with its accuracy comparable to supervised outlier classification models trained with ground truth labels - without having access to the much needed labels. On a diverse set of benchmark outlier detection datasets, AutoOD consistently outperforms the best unsupervised outlier detector selected from hundreds of detectors. It also outperforms other tuning-free approaches from 12 to 97 points (out of 100) in the F-1 score. 
    more » « less
  5. Abstract Clustering has long been a popular unsupervised learning approach to identify groups of similar objects and discover patterns from unlabeled data in many applications. Yet, coming up with meaningful interpretations of the estimated clusters has often been challenging precisely due to their unsupervised nature. Meanwhile, in many real-world scenarios, there are some noisy supervising auxiliary variables, for instance, subjective diagnostic opinions, that are related to the observed heterogeneity of the unlabeled data. By leveraging information from both supervising auxiliary variables and unlabeled data, we seek to uncover more scientifically interpretable group structures that may be hidden by completely unsupervised analyses. In this work, we propose and develop a new statistical pattern discovery method named supervised convex clustering (SCC) that borrows strength from both information sources and guides towards finding more interpretable patterns via a joint convex fusion penalty. We develop several extensions of SCC to integrate different types of supervising auxiliary variables, to adjust for additional covariates, and to find biclusters. We demonstrate the practical advantages of SCC through simulations and a case study on Alzheimer's disease genomics. Specifically, we discover new candidate genes as well as new subtypes of Alzheimer's disease that can potentially lead to better understanding of the underlying genetic mechanisms responsible for the observed heterogeneity of cognitive decline in older adults. 
    more » « less