skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM ET on Friday, February 6 until 10:00 AM ET on Saturday, February 7 due to maintenance. We apologize for the inconvenience.


Title: Supervised Dimensionality Reduction and Visualization using Centroid-Encoder
We propose a new tool for visualizing complex, and potentially large and high-dimensional, data sets called Centroid-Encoder (CE). The architecture of the Centroid-Encoder is similar to the autoencoder neural network but it has a modified target, i.e., the class centroid in the ambient space. As such, CE incorporates label information and performs a supervised data visualization. The training of CE is done in the usual way with a training set whose parameters are tuned using a validation set. The evaluation of the resulting CE visualization is performed on a sequestered test set where the generalization of the model is assessed both visually and quantitatively. We present a detailed comparative analysis of the method using a wide variety of data sets and techniques, both supervised and unsupervised, including NCA, non-linear NCA, t-distributed NCA, t-distributed MCML, supervised UMAP, supervised PCA, Colored Maximum Variance Unfolding, supervised Isomap, Parametric Embedding, supervised Neighbor Retrieval Visualizer, and Multiple Relational Embedding. An analysis of variance using PCA demonstrates that a non-linear preprocessing by the CE transformation of the data captures more variance than PCA by dimension.  more » « less
Award ID(s):
1830676
PAR ID:
10400710
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Journal of machine learning research
Volume:
23
Issue:
20
ISSN:
1532-4435
Page Range / eLocation ID:
1-34
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The t-distributed stochastic neighbor embedding (t-SNE) method is one of the leading techniques for data visualization and clustering. This method finds lower-dimensional embedding of data points while minimizing distortions in distances between neighboring data points. By construction, t-SNE discards information about large-scale structure of the data. We show that adding a global cost function to the t-SNE cost function makes it possible to cluster the data while preserving global intercluster data structure. We test the new global t-SNE (g-SNE) method on one synthetic and two real data sets on flower shapes and human brain cells. We find that significant and meaningful global structure exists in both the plant and human brain data sets. In all cases, g-SNE outperforms t-SNE and UMAP in preserving the global structure. Topological analysis of the clustering result makes it possible to find an appropriate trade-off of data distribution across scales. We find differences in how data are distributed across scales between the two subjects that were part of the human brain data set. Thus, by striving to produce both accurate clustering and positioning between clusters, the g-SNE method can identify new aspects of data organization across scales. 
    more » « less
  2. Mathelier, Anthony (Ed.)
    Abstract Motivation The rapid development of scRNA-seq technologies enables us to explore the transcriptome at the cell level on a large scale. Recently, various computational methods have been developed to analyze the scRNAseq data, such as clustering and visualization. However, current visualization methods, including t-SNE and UMAP, are challenged by the limited accuracy of rendering the geometric relationship of populations with distinct functional states. Most visualization methods are unsupervised, leaving out information from the clustering results or given labels. This leads to the inaccurate depiction of the distances between the bona fide functional states. In particular, UMAP and t-SNE are not optimal to preserve the global geometric structure. They may result in a contradiction that clusters with near distance in the embedded dimensions are in fact further away in the original dimensions. Besides, UMAP and t-SNE cannot track the variance of clusters. Through the embedding of t-SNE and UMAP, the variance of a cluster is not only associated with the true variance but also is proportional to the sample size. Results We present supCPM, a robust supervised visualization method, which separates different clusters, preserves the global structure and tracks the cluster variance. Compared with six visualization methods using synthetic and real datasets, supCPM shows improved performance than other methods in preserving the global geometric structure and data variance. Overall, supCPM provides an enhanced visualization pipeline to assist the interpretation of functional transition and accurately depict population segregation. Availability and implementation The R package and source code are available at https://zenodo.org/record/5975977#.YgqR1PXMJjM. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract Visible hyperspectral imaging (HSI) is a fast and non-invasive imaging method that has been adapted by the field of conservation science to study painted surfaces. By collecting reflectance spectra from a 2D surface, the resulting 3D hyperspectral data cube contains millions of recorded spectra. While processing such large amounts of spectra poses an analytical and computational challenge, it also opens new opportunities to apply powerful methods of multivariate analysis for data evaluation. With the intent of expanding current data treatment of hyperspectral datasets, an innovative approach for data reduction and visualization is presented in this article. It uses a statistical embedding method known as t-distributed stochastic neighbor embedding (t-SNE) to provide a non-linear representation of spectral features in a lower 2D space. The efficiency of the proposed method for painted surfaces from cultural heritage is established through the study of laboratory prepared paint mock-ups, and medieval French illuminated manuscript. 
    more » « less
  4. Principal component analysis (PCA) is an efficient tool to optimize multiparameter tests of general relativity (GR), wherein one looks for simultaneous deviations in multiple post-Newtonian phasing coefficients. This is accomplished by introducing non-GR deformation parameters in the phase evolution of the gravitational-wave templates used in the analysis. A PCA is performed to construct the “best-measured” linear combinations of the deformation parameters. This helps to set stringent limits on deviations from GR and to more readily detect possible beyond-GR physics. In this paper, we study the effectiveness of this method with the proposed next-generation gravitational-wave detectors, Cosmic Explorer (CE) and Einstein Telescope (ET). For compact binaries at a luminosity distance of 500 Mpc and the detector-frame total mass in the range 20–200M⊙, CE can measure the most dominant linear combination with a 1-σ uncertainty ∼0.1% and the next two subdominant linear combinations with a 1-σ uncertainty of ≤ 10%. For a specific range of masses, constraints from ET are better by a factor of a few than CE. This improvement is because of the improved low frequency sensitivity of ET compared to CE (between 1–7 Hz). In addition, we explain the sensitivity of the PCA parameters to the different post-Newtonian deformation parameters and discuss their variation with total mass. We also discuss a criterion for quantifying the number of most dominant linear combinations that capture the information in the signal up to a threshold. 
    more » « less
  5. null (Ed.)
    This paper proposes a scalable multilevel framework for the spectral embedding of large undirected graphs. The proposed method first computes much smaller yet sparse graphs while preserving the key spectral (structural) properties of the original graph, by exploiting a nearly-linear time spectral graph coarsening approach. Then, the resultant spectrally-coarsened graphs are leveraged for the development of much faster algorithms for multilevel spectral graph embedding (clustering) as well as visualization of large data sets. We conducted extensive experiments using a variety of large graphs and datasets and obtained very promising results. For instance, we are able to coarsen the "coPapersCiteseer" graph with 0.43 million nodes and 16 million edges into a much smaller graph with only 13K (32X fewer) nodes and 17K (950X fewer) edges in about 16 seconds; the spectrally-coarsened graphs allow us to achieve up to 1,100X speedup for multilevel spectral graph embedding (clustering) and up to 60X speedup for t-SNE visualization of large data sets. 
    more » « less