skip to main content


Title: Optimizing Data Locality and Termination Criterion for t-SNE
The t-Distributed Stochastic Neighbor Embedding (t-SNE) is known to be a successful method at visualizing high-dimensional data, making it very popular in the machine-learning and data analysis community, especially recently. However, there are two glaring unaddressed problems: (a) Existing GPU accelerated implementations of t-SNE do not account for the poor data locality present in the computation. This results in sparse matrix computations being a bottleneck during execution, especially for large data sets. (b) Another problem is the lack of an effective stopping criterion in the literature. In this paper, we report an improved GPU implementation that uses sparse matrix re-ordering to improve t-SNE's memory access pattern and a novel termination criterion that is better suited for visualization purposes. The proposed methods result in up to 4.63 x end-to-end speedup and provide a practical stopping metric, potentially preventing the algorithm from terminating prematurely or running for an excessive amount of iterations. These developments enable high-quality visualizations and accurate analyses of complex large data sets containing up to 10 million data points and requiring thousands of iterations for convergence.  more » « less
Award ID(s):
1822932
NSF-PAR ID:
10303753
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2021 International Joint Conference on Neural Networks (IJCNN)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications. 
    more » « less
  2. null (Ed.)
    This paper proposes a scalable multilevel framework for the spectral embedding of large undirected graphs. The proposed method first computes much smaller yet sparse graphs while preserving the key spectral (structural) properties of the original graph, by exploiting a nearly-linear time spectral graph coarsening approach. Then, the resultant spectrally-coarsened graphs are leveraged for the development of much faster algorithms for multilevel spectral graph embedding (clustering) as well as visualization of large data sets. We conducted extensive experiments using a variety of large graphs and datasets and obtained very promising results. For instance, we are able to coarsen the "coPapersCiteseer" graph with 0.43 million nodes and 16 million edges into a much smaller graph with only 13K (32X fewer) nodes and 17K (950X fewer) edges in about 16 seconds; the spectrally-coarsened graphs allow us to achieve up to 1,100X speedup for multilevel spectral graph embedding (clustering) and up to 60X speedup for t-SNE visualization of large data sets. 
    more » « less
  3. null (Ed.)
    Large Convolutional Neural Networks (CNNs) are often pruned and compressed to reduce the amount of parameters and memory requirement. However, the resulting irregularity in the sparse data makes it difficult for FPGA accelerators that contains systolic arrays of Multiply-and-Accumulate (MAC) units, such as Intel’s FPGA-based Deep Learning Accelerator (DLA), to achieve their maximum potential. Moreover, FPGAs with low-bandwidth off-chip memory could not satisfy the memory bandwidth requirement for sparse matrix computation. In this paper, we present 1) a sparse matrix packing technique that condenses sparse inputs and filters before feeding them into the systolic array of MAC units in the Intel DLA, and 2) a customization of the Intel DLA which allows the FPGA to efficiently utilize a high bandwidth memory (HBM2) integrated in the same package. For end-to-end inference with randomly pruned ResNet-50/MobileNet CNN models, our experiments demonstrate 2.7x/3x performance improvement compared to an FPGA with DDR4, 2.2x/2.1x speedup against a server-class Intel SkyLake CPU, and comparable performance with 1.7x/2x power efficiency gain as compared to an NVidia V100 GPU. 
    more » « less
  4. Persistent homology (PH) matrix reduction is an important tool for data analytics in many application areas. Due to its highly irregular execution patterns in computation, it is challenging to gain high efficiency in parallel processing for increasingly large data sets. In this paper, we introduce HYPHA, a HYbrid Persistent Homology matrix reduction Accelerator, to make parallel processing highly efficient on both GPU and multicore. The essential foundation of our algorithm design and implementation is the separation of SIMT and MIMD parallelisms in PH matrix reduction computation. With such a separation, we are able to perform massive parallel scanning operations on GPU in a super-fast manner, which also collects rich information from an input boundary matrix for further parallel reduction operations on multicore with high efficiency. The HYPHA framework may provide a general purpose guidance to high performance computing on multiple hardware accelerators. To our best knowledge, HYPHA achieves the highest performance in PH matrix reduction execution. Our experiments show speedups of up to 116x against the standard PH algorithm. Compared to the state-of-the-art parallel PH software packages, such as PHAT and DIPHA, HYPHA outperforms their fastest PH matrix reduction algorithms by factor up to 2.3x. 
    more » « less
  5. With the rise of machine learning and data analytics, the ability to process large and diverse sets of data efficiently has become crucial. Research has shown that data transformation is a key performance bottleneck for applications across a variety of domains, from data analytics to scientific computing. Custom hardware accelerators and GPU implementations targeting specific data transformation tasks can alleviate the problem, but suffer from narrow applicability and lack of generality.To tackle this problem, we propose a GPU-accelerated data transformation engine grounded on pushdown transducers. We define an extended pushdown transducer abstraction (effPDT) that allows expressing a wide range of data transformations in a memory-efficient fashion, and is thus amenable for GPU deployment. The effPDT execution engine utilizes a data streaming model that reduces the application’s memory requirements significantly, facilitating deployment on high- and low-end systems. We showcase our GPU-accelerated engine on a diverse set of transformation tasks covering data encoding/decoding, parsing and querying of structured data, and matrix transformation, and we evaluate it against publicly available CPU and GPU library implementations of the considered data transformation tasks. To understand the benefits of the effPDT abstraction, we extend our data transformation engine to also support finite state transducers (FSTs), we map the considered data transformation tasks on FSTs, and we compare the performance and resource requirements of the FST-based and the effPDT-based implementations. 
    more » « less