skip to main content


This content will become publicly available on April 25, 2025

Title: ENS-t-SNE: Embedding Neighborhoods Simultaneously t-SNE
When visualizing a high-dimensional dataset, dimension reduction techniques are commonly employed which provide a single 2 dimensional view of the data. We describe ENS-t-SNE: an algorithm for Embedding Neighborhoods Simultaneously that generalizes the t-Stochastic Neighborhood Embedding approach. By using different viewpoints in ENS-t-SNE’s 3D embedding, one can visualize different types of clusters within the same high-dimensional dataset. This enables the viewer to see and keep track of the different types of clusters, which is harder to do when providing multiple 2D embeddings, where corresponding points cannot be easily identified. We illustrate the utility of ENS-t-SNE with real-world applications and provide an extensive quantitative evaluation with datasets of different types and sizes.  more » « less
Award ID(s):
2212130
NSF-PAR ID:
10493388
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
17th IEEE Pacific Visualization Symposium (PACIFICVIS)
Format(s):
Medium: X
Location:
Tokyo, Japan
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The t-distributed stochastic neighbor embedding (t-SNE) method is one of the leading techniques for data visualization and clustering. This method finds lower-dimensional embedding of data points while minimizing distortions in distances between neighboring data points. By construction, t-SNE discards information about large-scale structure of the data. We show that adding a global cost function to the t-SNE cost function makes it possible to cluster the data while preserving global intercluster data structure. We test the new global t-SNE (g-SNE) method on one synthetic and two real data sets on flower shapes and human brain cells. We find that significant and meaningful global structure exists in both the plant and human brain data sets. In all cases, g-SNE outperforms t-SNE and UMAP in preserving the global structure. Topological analysis of the clustering result makes it possible to find an appropriate trade-off of data distribution across scales. We find differences in how data are distributed across scales between the two subjects that were part of the human brain data set. Thus, by striving to produce both accurate clustering and positioning between clusters, the g-SNE method can identify new aspects of data organization across scales. 
    more » « less
  2. The t-Distributed Stochastic Neighbor Embedding (t-SNE) is known to be a successful method at visualizing high-dimensional data, making it very popular in the machine-learning and data analysis community, especially recently. However, there are two glaring unaddressed problems: (a) Existing GPU accelerated implementations of t-SNE do not account for the poor data locality present in the computation. This results in sparse matrix computations being a bottleneck during execution, especially for large data sets. (b) Another problem is the lack of an effective stopping criterion in the literature. In this paper, we report an improved GPU implementation that uses sparse matrix re-ordering to improve t-SNE's memory access pattern and a novel termination criterion that is better suited for visualization purposes. The proposed methods result in up to 4.63 x end-to-end speedup and provide a practical stopping metric, potentially preventing the algorithm from terminating prematurely or running for an excessive amount of iterations. These developments enable high-quality visualizations and accurate analyses of complex large data sets containing up to 10 million data points and requiring thousands of iterations for convergence. 
    more » « less
  3. Based on high-quality Apache Point Observatory Galactic Evolution Experiment (APOGEE) DR17 and Gaia DR3 data for 1742 red giants stars within 5 kpc of the Sun and not rotating with the Galactic disk ( V ϕ  < 100 km s −1 ), we used the nonlinear technique of unsupervised analysis t-Distributed Stochastic Neighbor Embedding (t-SNE) to detect coherent structures in the space of ten chemical-abundance ratios: [Fe/H], [O/Fe], [Mg/Fe], [Si/Fe], [Ca/Fe], [C/Fe], [N/Fe], [Al/Fe], [Mn/Fe], and [Ni/Fe]. Additionally, we obtained orbital parameters for each star using the nonaxisymmetric gravitational potential GravPot16 . Seven structures are detected, including Splash, Gaia-Sausage-Enceladus (GSE), the high- α heated-disk population, N-C-O peculiar stars, and inner disk-like stars, plus two other groups that did not match anything previously reported in the literature, here named Galileo 5 and Galileo 6 (G5 and G6). These two groups overlap with Splash in [Fe/H], with G5 having a lower metallicity than G6, and they are both between GSE and Splash in the [Mg/Mn] versus [Al/Fe] plane, with G5 being in the α -rich in situ locus and G6 on the border of the α -poor in situ one. Nonetheless, their low [Ni/Fe] hints at a possible ex situ origin. Their orbital energy distributions are between Splash and GSE, with G5 being slightly more energetic than G6. We verified the robustness of all the obtained groups by exploring a large range of t-SNE parameters, applying it to various subsets of data, and also measuring the effect of abundance errors through Monte Carlo tests. 
    more » « less
  4. Abstract Visible hyperspectral imaging (HSI) is a fast and non-invasive imaging method that has been adapted by the field of conservation science to study painted surfaces. By collecting reflectance spectra from a 2D surface, the resulting 3D hyperspectral data cube contains millions of recorded spectra. While processing such large amounts of spectra poses an analytical and computational challenge, it also opens new opportunities to apply powerful methods of multivariate analysis for data evaluation. With the intent of expanding current data treatment of hyperspectral datasets, an innovative approach for data reduction and visualization is presented in this article. It uses a statistical embedding method known as t-distributed stochastic neighbor embedding (t-SNE) to provide a non-linear representation of spectral features in a lower 2D space. The efficiency of the proposed method for painted surfaces from cultural heritage is established through the study of laboratory prepared paint mock-ups, and medieval French illuminated manuscript. 
    more » « less
  5. Abstract Motivation

    The rapid development of scRNA-seq technologies enables us to explore the transcriptome at the cell level on a large scale. Recently, various computational methods have been developed to analyze the scRNAseq data, such as clustering and visualization. However, current visualization methods, including t-SNE and UMAP, are challenged by the limited accuracy of rendering the geometric relationship of populations with distinct functional states. Most visualization methods are unsupervised, leaving out information from the clustering results or given labels. This leads to the inaccurate depiction of the distances between the bona fide functional states. In particular, UMAP and t-SNE are not optimal to preserve the global geometric structure. They may result in a contradiction that clusters with near distance in the embedded dimensions are in fact further away in the original dimensions. Besides, UMAP and t-SNE cannot track the variance of clusters. Through the embedding of t-SNE and UMAP, the variance of a cluster is not only associated with the true variance but also is proportional to the sample size.

    Results

    We present supCPM, a robust supervised visualization method, which separates different clusters, preserves the global structure and tracks the cluster variance. Compared with six visualization methods using synthetic and real datasets, supCPM shows improved performance than other methods in preserving the global geometric structure and data variance. Overall, supCPM provides an enhanced visualization pipeline to assist the interpretation of functional transition and accurately depict population segregation.

    Availability and implementation

    The R package and source code are available at https://zenodo.org/record/5975977#.YgqR1PXMJjM.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less