GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted com- putations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environ- mental footprint, the eciency of HPC operations becomes both an imperative and a challenge.
We examine DBEs using system telemetry data and logs col- lected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). Using exploratory data analysis and statistical learning, we extract several insights about memory reliability in such GPUs. We nd that GPUs with prior DBE occurrences are prone to experience them again due to otherwise harmless factors, correlate this phenomenon with GPU placement, and suggest manufacturing variability as a factor. On the general population of GPUs, we link DBEs to short- and long-term high power consumption modes while finding no signifcant correlation with higher temperatures. We also show that the workload type can be a factor in memory’s propensity to corruption.
more »
« less
Least Squares on GPUs in Multiple Double Precision
This paper describes the application of the code
generated by the CAMPARY software to accelerate the solving of
linear systems in the least squares sense on Graphics Processing
Units (GPUs), in double double, quad double, and octo double
precision. The goal is to use accelerators to offset the cost
overhead caused by multiple double precision arithmetic. For the
blocked Householder QR and the back substitution, of interest
are those dimensions at which teraflop performance is attained.
The other interesting question is the cost overhead factor that
appears each time the precision is doubled.
Experimental results are reported on five different NVIDIA
GPUs, with a particular focus on the P100 and the V100, both
capable of teraflop performance. Thanks to the high Compute
to Global Memory Access (CGMA) ratios of multiple double
arithmetic, teraflop performance is already attained running the
double double QR on 1,024-by-1,024 matrices, both on the P100
and the V100. For the back substitution, the dimension of the
upper triangular system must be as high as 17,920 to reach
one teraflops on the V100, in quad double precision, and then
taking only the times spent by the kernels into account. The
lower performance of the back substitution in small dimensions
does not prevent teraflop performance of the solver at dimension
1,024, as the time for the QR decomposition dominates.
In doubling the precision from double double to quad double and from quad double to octo double, the observed cost
overhead factors are lower than the factors predicted by the
arithmetical operation counts. This observation correlates with
the increased performance for increased precision, which can
again be explained by the high CGMA ratios.
more »
« less
- Award ID(s):
- 1854513
- PAR ID:
- 10463308
- Date Published:
- Journal Name:
- 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper presents a configurable binary design library including fundamental arithmetic circuits like full-adder, full-subtractor, binary multiplier, shifter, and more. The Chisel Hardware Construction Language (HCL) is employed to build the parameterizable designs with different precision including half-word, word, double-word, and quad-word. Chisel HCL is an open-source embedded domain-specific language that inherits the object-oriented and functional programming aspects of Scala for constructing hardware. Experimental results show the same accuracy achieved by our proposed work compared with the Verilog HDL implementations. The hardware cost in terms of slice count, power consumption, and the maximum clock frequency is further estimated. Compared with traditional design intellectual properties (IPs) provided by IP vendors, our proposed work is configurable and expandable to the other arithmetic implementations and projects.more » « less
-
In general, the performance of parallel graph processing is determined by three pairs of critical parameters, namely synchronous or asynchronous execution mode (Sync or Async), Push or Pull communication mechanism (Push or Pull), and Data-driven or Topology-driven traversing scheme (DD or TD), which increases the complexity and sophistication of programming and system implementation of GPU. Existing graph-processing frameworks mainly use a single combination in the entire execution for a given application, but we have observed their variable and suboptimal performance. In this paper, we present SEP-Graph, a highly efficient software framework for graph-processing on GPU. The hybrid execution mode is automatically switched among three pairs of parameters, with an objective to achieve the shortest execution time in each iteration. We also apply a set of optimizations to SEP-Graph, considering the characteristics of graph algorithms and underlying GPU architectures. We show the effectiveness of SEP-Graph based on our intensive and comparative performance evaluation on NVIDIA 1080, P100, and V100 GPUs. Compared with existing and representative GPU graph-processing framework Groute and Gunrock, SEP-Graph can reduce execution time up to 45.8 times and 39.4 times.more » « less
-
Ranzato, M. ; Beygelzimer, A. ; Dauphin Y. ; Liang, P.S. ; Wortman Vaughan, J. (Ed.)Hyperbolic space is particularly useful for embedding data with hierarchical structure; however, representing hyperbolic space with ordinary floating-point numbers greatly affects the performance due to its \emph{ineluctable} numerical errors. Simply increasing the precision of floats fails to solve the problem and incurs a high computation cost for simulating greater-than-double-precision floats on hardware such as GPUs, which does not support them. In this paper, we propose a simple, feasible-on-GPUs, and easy-to-understand solution for numerically accurate learning on hyperbolic space. We do this with a new approach to represent hyperbolic space using multi-component floating-point (MCF) in the Poincar{\'e} upper-half space model. Theoretically and experimentally we show our model has small numerical error, and on embedding tasks across various datasets, models represented by multi-component floating-points gain more capacity and run significantly faster on GPUs than prior work.more » « less
-
We present a high-performance GPU kernel with a substantial speedup over vendor libraries for very small matrix computations. In addition, we discuss most of the challenges that hinder the design of efficient GPU kernels for small matrix algorithms. We propose relevant algorithm analysis to harness the full power of a GPU, and strategies for predicting the performance, before introducing a proper implementation. We develop a theoretical analysis and a methodology for high-performance linear solvers for very small matrices. As test cases, we take the Cholesky and LU factorizations and show how the proposed methodology enables us to achieve a performance close to the theoretical upper bound of the hardware. This work investigates and proposes novel algorithms for designing highly optimized GPU kernels for solving batches of hundreds of thousands of small-size Cholesky and LU factorizations. Our focus on efficient batched Cholesky and batched LU kernels is motivated by the increasing need for these kernels in scientific simulations (e.g., astrophysics applications). Techniques for optimal memory traffic, register blocking, and tunable concurrency are incorporated in our proposed design. The proposed GPU kernels achieve performance speedups versus CUBLAS of up to 6× for the factorizations, using double precision arithmetic on an NVIDIA Pascal P100 GPU.more » « less