Alternating Least Square (ALS) is a classic algorithm to solve matrix factorization widely used in recommendation systems. Existing efforts focus on parallelizing ALS on multi-/many-core platforms to handle large datasets. Recently, an optimized ALS variant called eALS was proposed, and it yields significantly lower time complexity and higher recommending accuracy than ALS. However, it is challenging to parallelize eALS on modern parallel architectures (e.g., CPUs and GPUs) because: 1) eALSβ data dependence prevents it from fine-grained parallel execution, thus eALS cannot fully utilize GPU's massive parallelism, 2) the sparsity of input data causes poor data locality and unbalanced workload, and 3) its large memory usage cannot fit into GPU's limited on-device memory, particularly for real-world large datasets. This paper proposes an efficient CPU/GPU heterogeneous recommendation system based on fast eALS for the first time (called HEALS) that consists of a set of system optimizations. HEALS employs newly designed architecture-adaptive data formats to achieve load balance and good data locality on CPU and GPU. HEALS also presents a CPU/GPU collaboration model that can explore both task parallelism and data parallelism. HEALS also optimizes this collaboration model with data communication overlapping and dynamic workload partition between CPU and GPU. Moreover, HEALS is further enhanced by various parallel techniques (e.g., loop unrolling, vectorization, and GPU parallel reduction). Evaluation results show that HEALS outperforms other state-of-the-art baselines in both performance and recommendation quality. Particularly, HEALS achieves up to 5.75 x better performance than a state-of-the-art ALS GPU library. This work also demonstrates the possibility of conducting fast recommendations on large datasets with constrained (or relaxed) hardware resources, e.g, a single CPU/GPU node. 
                        more » 
                        « less   
                    
                            
                            GPU-SFFT: A GPU based parallel algorithm for computing the Sparse Fast Fourier Transform (SFFT) of k-sparse signals
                        
                    
    
            The Sparse Fast Fourier Transform (MIT-SFFT) is an algorithm to compute the discrete Fourier transform of a signal with a sublinear time complexity, i.e. algorithms with runtime complexity proportional to the sparsity level k, where k is the number of non-zero coefficients of the signal in the frequency domain. In this paper, we propose a highly scalable GPU-based parallel algorithm called GPU-SFFT for computing the SFFT of k-sparse signals. Our implementation of GPU-SFFT is based on parallel optimizations that leads to enormous speedups. These include carefully crafting parallel regions in the sequential MIT-SFFT code to exploit parallelism, and minimizing data movement between the CPU and the GPU. This allows us to exploit extreme parallelism for the CPU-GPU architectures and to maximize the number of concurrent threads executing instructions. Our experiments show that our designed CPU-GPU specific optimizations lead to enormous decrease in the run times needed for computing the SFFT. Further we show that GPU-SFFT is 38x times faster than the MIT-SFFT and 5x faster than cuFFT, the NVIDIA CUDA Fast Fourier Transform (FFT) library. The source code for GPU-SFFT is available at https://github.com/pcdslab. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1925960
- PAR ID:
- 10140312
- Date Published:
- Journal Name:
- Proceedings of IEEE International Conference on Big Data (Big Data)
- Page Range / eLocation ID:
- 3303 to 3311
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Multi-pattern matching is widely used in modern software for applications requiring high throughput such as protein search, network traffic inspection, virus or spam detection. Graphics Processor Units (GPUs) excel at executing massively parallel workloads. Regular expression (regex) matching is typically performed by simulating the execution of deterministic finite automata (DFAs) or nondeterministic finite automata (NFAs). The natural implementations of these automata simulation algorithms on GPUs are highly inefficient because they give rise to irregular memory access patterns. This paper presents HybridSA, a heterogeneous CPU-GPU parallel engine for multi-pattern matching. HybridSA uses bit parallelism to efficiently simulate NFAs on GPUs, thus reducing the number of memory accesses and increasing the throughput. Our bit-parallel algorithms extend the classical shift-and algorithm for string matching to a large class of regular expressions and reduce automata simulation to a small number of bitwise operations. We have developed a compiler to translate regular expressions into bit masks, perform optimizations, and choose the best algorithms to run on the GPU. The majority of the regular expressions are accelerated on the GPU, while the patterns that exhibit random memory accesses are executed on the CPU in parallel. We evaluate HybridSA against state-of-the-art CPU and GPU engines, as well as a hybrid combination of the two. HybridSA achieves between 4 and 60 times higher throughput than the state-of-the-art CPU engine and between 4 and 233 times better than the state-of-the-art GPU engine across a collection of real-world benchmarks.more » « less
- 
            Studying dynamic-functional connectivity (DFC) using fMRI data of the brain gives much richer information to neuroscientists than studying the brain as a static entity. Mining of dynamic connectivity graphs from these brain studies can be used to classify diseased versus healthy brains. However, constructing and mining dynamic-functional connectivity graphs of the brain can be time consuming due to size of fMRI data. In this paper, we propose a highly scalable GPU-based parallel algorithm called GPU-DFC for computing dynamic-functional connectivity of fMRI data both at region and voxel level. Our algorithm exploits sparsification of correlation matrix and stores them in CSR format. Further reduction in the correlation matrix is achieved by parallel decomposition techniques. Our GPU-DFC algorithm achieves 2 times speed-up for computing dynamic correlations compared to state-of-the-art GPU-based techniques and more than 40 times compared to a sequential CPU version. In terms of storage, our proposed matrix decomposition technique reduces the size of correlation matrices more than 100 times. Reconstructed values from decomposed matrices show comparable results as compared to the correlations with original data. The implemented code is available as GPL license on GitHub portal of our lab (https://github.com/pcdslab/GPU-DFC).more » « less
- 
            null (Ed.)In this paper we consider the following sparse recovery problem. We have query access to a vector π± β β^N such that xΜ = π π± is k-sparse (or nearly k-sparse) for some orthogonal transform π . The goal is to output an approximation (in an πβ sense) to xΜ in sublinear time. This problem has been well-studied in the special case that π is the Discrete Fourier Transform (DFT), and a long line of work has resulted in sparse Fast Fourier Transforms that run in time O(k β polylog N). However, for transforms π other than the DFT (or closely related transforms like the Discrete Cosine Transform), the question is much less settled. In this paper we give sublinear-time algorithms - running in time poly(k log(N)) - for solving the sparse recovery problem for orthogonal transforms π that arise from orthogonal polynomials. More precisely, our algorithm works for any π that is an orthogonal polynomial transform derived from Jacobi polynomials. The Jacobi polynomials are a large class of classical orthogonal polynomials (and include Chebyshev and Legendre polynomials as special cases), and show up extensively in applications like numerical analysis and signal processing. One caveat of our work is that we require an assumption on the sparsity structure of the sparse vector, although we note that vectors with random support have this property with high probability. Our approach is to give a very general reduction from the k-sparse sparse recovery problem to the 1-sparse sparse recovery problem that holds for any flat orthogonal polynomial transform; then we solve this one-sparse recovery problem for transforms derived from Jacobi polynomials. Frequently, sparse FFT algorithms are described as implementing such a reduction; however, the technical details of such works are quite specific to the Fourier transform and moreover the actual implementations of these algorithms do not use the 1-sparse algorithm as a black box. In this work we give a reduction that works for a broad class of orthogonal polynomial families, and which uses any 1-sparse recovery algorithm as a black box.more » « less
- 
            null (Ed.)We present an optimized Floyd-Warshall (Floyd-Warshall) algorithm that computes the All-pairs shortest path (APSP) for GPU accelerated clusters. The Floyd-Warshall algorithm due to its structural similarities to matrix-multiplication is well suited for highly parallel GPU architectures. To achieve high parallel efficiency, we address two key algorithmic challenges: reducing high communication overhead and addressing limited GPU memory. To reduce high communication costs, we redesign the parallel (a) to expose more parallelism, (b) aggressively overlap communication and computation with pipelined and asynchronous scheduling of operations, and (c) tailored MPI-collective. To cope with limited GPU memory, we employ an offload model, where the data resides on the host and is transferred to GPU on-demand. The proposed optimizations are supported with detailed performance models for tuning. Our optimized parallel Floyd-Warshall implementation is up to 5x faster than a strong baseline and achieves 8.1 PetaFLOPS/sec on 256~nodes of the Summit supercomputer at Oak Ridge National Laboratory. This performance represents 70% of the theoretical peak and 80% parallel efficiency. The offload algorithm can handle 2.5x larger graphs with a 20% increase in overall running time.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    