skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure
Many domains of scientific simulation (chemistry, condensed matter physics, data science) increasingly eschew dense tensors for block-sparse tensors, sometimes with additional structure (recursive hierarchy, rank sparsity, etc.). Distributed-memory parallel computation with block-sparse tensorial data is paramount to minimize the time-to-solution (e.g., to study dynamical problems or for real-time analysis) and to accommodate problems of realistic size that are too large to fit into the host/device memory of a single node equipped with accelerators. Unfortunately, computation with such irregular data structures is a poor match to the dominant imperative, bulk-synchronous parallel programming model. In this paper, we focus on the critical element of block-sparse tensor algebra, namely binary tensor contraction, and report on an efficient and scalable implementation using the task-focused PaRSEC runtime. High performance of the block-sparse tensor contraction on the Summit supercomputer is demonstrated for synthetic data as well as for real data involved in electronic structure simulations of unprecedented size.  more » « less
Award ID(s):
1931387 1931384 1931347
PAR ID:
10301393
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
Page Range / eLocation ID:
537 to 546
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Tensor contractions are ubiquitous in computational chemistry andphysics, where tensors generally represent states or operators andcontractions express the algebra of these quantities. In this context,the states and operators often preserve physical conservation laws,which are manifested as group symmetries in the tensors. These groupsymmetries imply that each tensor has block sparsity and can be storedin a reduced form. For nontrivial contractions, the memory footprint andcost are lowered, respectively, by a linear and a quadratic factor inthe number of symmetry sectors. State-of-the-art tensor contractionsoftware libraries exploit this opportunity by iterating over blocks orusing general block-sparse tensor representations. Both approachesentail overhead in performance and code complexity. With intuition aidedby tensor diagrams, we present a technique, irreducible representationalignment, which enables efficient handling of Abelian group symmetriesvia only dense tensors, by using contraction-specific reduced forms.This technique yields a general algorithm for arbitrary group symmetriccontractions, which we implement in Python and apply to a variety ofrepresentative contractions from quantum chemistry and tensor networkmethods. As a consequence of relying on only dense tensor contractions,we can easily make use of efficient batched matrix multiplication viaIntel’s MKL and distributed tensor contraction via the Cyclops library,achieving good efficiency and parallel scalability on up to 4096 KnightsLanding cores of a supercomputer. 
    more » « less
  2. This article presents a code generator for sparse tensor contraction computations. It leverages a mathematical representation of loop nest computations in the sparse polyhedral framework (SPF), which extends the polyhedral model to support non-affine computations, such as those that arise in sparse tensors. SPF is extended to perform layout specification, optimization, and code generation of sparse tensor code: (1) We develop a polyhedral layout specification that decouples iteration spaces for layout and computation; and (2) we develop efficient co-iteration of sparse tensors by combining polyhedra scanning over the layout of one sparse tensor with the synthesis of code to find corresponding elements in other tensors through an SMT solver. We compare the generated code with that produced by a state-of-the-art tensor compiler, TACO. We achieve on average 1.63× faster parallel performance than TACO on sparse-sparse co-iteration and describe how to improve that to 2.72× average speedup by switching the find algorithms. We also demonstrate that decoupling iteration spaces of layout and computation enables additional layout and computation combinations to be supported. 
    more » « less
  3. null (Ed.)
    We consider the problem of low-rank approximation of massive dense nonnegative tensor data, for example, to discover latent patterns in video and imaging applications. As the size of data sets grows, single workstations are hitting bottlenecks in both computation time and available memory. We propose a distributed-memory parallel computing solution to handle massive data sets, loading the input data across the memories of multiple nodes, and performing efficient and scalable parallel algorithms to compute the low-rank approximation. We present a software package called Parallel Low-rank Approximation with Nonnegativity Constraints, which implements our solution and allows for extension in terms of data (dense or sparse, matrices or tensors of any order), algorithm (e.g., from multiplicative updating techniques to alternating direction method of multipliers), and architecture (we exploit GPUs to accelerate the computation in this work). We describe our parallel distributions and algorithms, which are careful to avoid unnecessary communication and computation, show how to extend the software to include new algorithms and/or constraints, and report efficiency and scalability results for both synthetic and real-world data sets. 
    more » « less
  4. Sparse tensor networks represent contractions over multiple sparse tensors. Tensor contractions are higher-order analogs of matrix multiplication. Tensor networks arise commonly in many domains of scientific computing and data science. Such networks are typically computed using a tree of binary contractions. Several critical inter-dependent aspects must be considered in the generation of efficient code for a contraction tree, including sparse tensor layout mode order, loop fusion to reduce intermediate tensors, and the mutual dependence of loop order, mode order, and contraction order. We propose CoNST, a novel approach that considers these factors in an integrated manner using a single formulation. Our approach creates a constraint system that encodes these decisions and their interdependence, while aiming to produce reduced-order intermediate tensors via fusion. The constraint system is solved by the Z3 SMT solver and the result is used to create the desired fused loop structure and tensor mode layouts for the entire contraction tree. This structure is lowered to the IR of the TACO compiler, which is then used to generate executable code. Our experimental evaluation demonstrates significant performance improvements over current state-of-the-art sparse tensor compiler/library alternatives. 
    more » « less
  5. The matricized-tensor times Khatri-Rao product (MTTKRP) is the computational bottleneck for algorithms computing CP decompositions of tensors. In this work, we develop shared-memory parallel algorithms for MTTKRP involving dense tensors. The algorithms cast nearly all of the computation as matrix operations in order to use optimized BLAS subroutines, and they avoid reordering tensor entries in memory. We use our parallel implementation to compute a CP decomposition of a neuroimaging data set and achieve a speedup of up to 7.4X over existing parallel software. 
    more » « less