Search for: All records

Award ID contains: 1643056

« Prev Next »

Total Resources

6

Resource Type
Conference Paper

6

Conference Proceeding

0

Dataset

0

Journal Article

0

Workshop Report

0

Availability
Full Text / Resource Available

6

Citation Only

0

Save Results
Excel (limit 2000)
CSV (limit 5000)
XML (limit 5000)

Have feedback or suggestions for a way to improve these results?
!

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Towards Triangle Counting on GPU using Stable Radix Binning

Tirpankar, Nishith ; Sundar, Hari ( September 2018 , IEEE High Performance Extreme Computing Conference)

Full Text Available
Improving Performance and Scalability of Algebraic Multigrid through a Specialized MATVEC

Majid Rasouli, Vidhi Zala ( September 2018 , IEEE High Performance Extreme Computing Conference)

Algebraic Multigrid (AMG) is an extremely popular linear system solver and/or preconditioner approach for matrices obtained from the discretization of elliptic operators. However, its performance and scalability for large systems obtained from unstructured discretizations seem less consistent than for geometric multigrid (GMG). To a large extent, this is due to loss of sparsity at the coarser grids and the resulting increased cost and poor scalability of the matrix-vector multiplication. While there have been attempts to address this concern by designing sparsification algorithms, these affect the overall convergence. In this work, we focus on designing a specialized matrix-vector multiplication (matvec) that achieves high performance and scalability for a large variation in the levels of sparsity. We evaluate distributed and shared memory implementations of our matvec operator and demonstrate the improvements to its scalability and performance in AMG hierarchy and finally, we compare it with PETSc.
more » « less
Full Text Available
Utilizing GPU Parallelism to improve Fast Spherical Harmonic Transforms

Carlson, Max ; Sundar, Hari ( September 2018 , IEEE High Performance Extreme Computing Conference)

Full Text Available
Moving the Needle on Rigorous Floating-Point Precision Tuning

https://doi.org/10.29007/f4f3

Baranowski, Marek ; Briggs, Ian ; Chiang, Wei-Fan ; Gopalakrishnan, Ganesh ; Rakamaric, Zvonimir ; Solovyev, Alexey ( January 2018 , Kalpa Publications in Computing)

Virtually all real-valued computations are carried out using floating-point data types and operations. With increasing emphasis on overall computational efficiency, compilers are increasingly attempting to optimize floating-point expressions. Practical reasoning about the correctness of these optimizations requires error analysis procedures that are rigorous (ideally, they can generate proof certificates), can handle a wide variety of operators (e.g., transcendentals), and handle all normal programmatic constructs (e.g., conditionals and loops). Unfortunately, none of today’s approaches can achieve this combination. This position paper summarizes recent progress achieved in the community on this topic. It then showcases the component techniques present within our own rigorous floating-point precision tuning framework called FPTuner—essentially offering a collection of “grab and go” tools that others can benefit from. Finally, we present FPTuner’s limitations and describe how we can exploit contemporaneous research to improve it.

more » « less
Full Text Available
A Scalable Hierarchical Semi-Separable Library for Heterogeneous Clusters

https://doi.org/10.1109/ICPP.2017.60

Fernando, Isuru Dilanka ; Jayasena, Sanath ; Fernando, Milinda ; Sundar, Hari ( August 2017 , 46th International Conference on Parallel Processing (ICPP))

We present a scalable distributed memory library for generating and computations involving structured dense matrices, such as those produced by boundary integral equation formulations. Such matrices are dense, but have special structure that can be exploited to obtain efficient storage and matrix-vector product evaluations and consequently the fast solution of linear systems. At the core of the methods we use is the observation that off-diagonal matrix blocks of such matrices have a low numerical rank, and that this property can be exploited in a multi-level fashion. In this work we focus on the Hierarchically Semi-Separable (HSS) representation. We present algorithms for building and using HSS representations that are parallelized using MPI and CUDA to leverage state-of-the-art heterogeneous clusters. The efficiency of our methods and implementation is demonstrated on large dense matrices obtained from a boundary integral equation formulation of the Laplace equation with Dirichlet boundary conditions. We demonstrate excellent (linear) scalability on up to 128 GPUs on 128 nodes. Our codes will lay the foundation for fast direct solvers for elliptic problems.
more » « less
Full Text Available
Machine and Application Aware Partitioning for Adaptive Mesh Refinement Applications

https://doi.org/10.1145/3078597.3078610

Fernando, Milinda ; Duplyakin, Dmitry ; Sundar, Hari ( June 2017 , Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing)

Load balancing and partitioning are critical when it comes to parallel computations. Popular partitioning strategies based on space filling curves focus on equally dividing work. The partitions produced are independent of the architecture or the application. Given the ever-increasing relative cost of data movement and increasing heterogeneity of our architectures, it is no longer sufficient to only consider an equal partitioning of work. Minimizing communication costs are equally if not more important. Our hypothesis is that an unequal partitioning that minimizes communication costs significantly can scale and perform better than conventional equal-work partitioning schemes. This tradeoff is dependent on the architecture as well as the application. We validate our hypothesis in the context of a finite-element computation utilizing adaptive mesh-refinement. Our central contribution is a new partitioning scheme that minimizes the overall runtime of subsequent computations by performing architecture and application-aware non-uniform work assignment in order to decrease time to solution, primarily by minimizing data-movement. We evaluate our algorithm by comparing it against standard space-filling curve based partitioning algorithms and observing time-to-solution as well as energy-to-solution for solving Finite Element computations on adaptively refined meshes. We demonstrate excellent scalability of our new partition algorithm up to 262,144 cores on ORNL's Titan and demonstrate that the proposed partitioning scheme reduces overall energy as well as time-to-solution for application codes by up to 22.0%
more » « less
Full Text Available