<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ghZCCL: Advancing GPU-aware Collective Communications with Homomorphic Compression</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/08/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10609574</idno>
					<idno type="doi"></idno>
					
					<author>Jiajun Huang</author><author>Sheng Di</author><author>Yafan Huang</author><author>Zizhong Chen</author><author>Franck Cappello</author><author>Yanfei Guo</author><author>Rajeev Thakur</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[facilitate the saturated networks, in the past, researchers proposed different bandwidth-optimized collective algorithms to decrease the overall collective communication volume [3,37,39]. However, these algorithms have already approached their limits in enhancing the collective performance and have limited space for further optimization. With the recent development of GPU-based ultra-fast error-bounded lossy compression [24,25,46], it is now possible to utilize errorbounded lossy compression to significantly reduce message sizes and accelerate collective communications while preserving high data quality.
Limitations of Existing Works and GoalRecently, researchers have proposed several compressionaccelerated collective communication libraries that achieve sound speedups over the previous approaches without compression support while maintaining high data accuracy as shown in Table 1. However, the existing state-of-the-art communication libraries all demonstrate certain limitations. The HPE Cray-MPI [17] is a GPU-aware MPI that reaches high performance in Cray Systems (two out of three US exascale supercomputers communicate with Cray MPI [41]). Although its collectives are GPU-aware, they still rely on intermediate CPU buffers, which leads to suboptimal performance [42]. NCCL [12]  is another high-performance collective communications library on systems with NVIDIA GPUs. It is GPU-centric but lacks compression support, which significantly limits its collective performance. The state-of-the-art C-Coll [21,22] and gZCCL [19] utilize error-bounded lossy compression to accelerate collective communications on CPU and GPU clusters, respectively. However, they are subjected to time-consuming Decompression-Operation-Compression (DOC) workflow, in which each process has to decompress the compressed data before applying operations and then recompress the operated data.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Cray-MPI</head><p>Table <ref type="table">1</ref>: Key designs of state-of-the-art collective communications libraries. The Homomorphic Capability means that GPUs can directly operate on compressed data.</p><p>An ideal GPU-aware, compression-accelerated collective communications library should meet the following criteria:</p><p>&#8226; GPU-centric design that avoids host-device data transfers and CPU computation overheads. &#8226; Compression support with accuracy control to achieve performance improvements with high data quality.</p><p>&#8226; Co-designed compression to maximize both throughput and compression ratio. &#8226; Direct GPU operations on compressed data during intensive communications, eliminating the need for expensive DOC workflows.</p><p>To propose such an ideal solution, several new challenges must be addressed:</p><p>&#8226; Homomorphic compression: How can GPUs compute with compressed data during communication, removing the need for costly DOC workflows? Currently, no existing GPU compressors offer this functionality. &#8226; Performance vs. quality: How can we ensure that a GPU homomorphic workflow delivers high compression performance without sacrificing quality? &#8226; Co-design with collective communications: How can we co-design this new compression workflow with collective communications to achieve the best overall performance for GPU-centric communication?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Our solution: ghZCCL</head><p>To address the aforementioned limitations of existing works and new challenges, we propose ghZCCL, which is a GPUaware homomorphic compression-accelerated collective communications library that allows GPUs to directly compute and communicate with compressed data. To the best of our knowledge, ghZCCL is the first-ever GPU-aware homomorphic compression-communication co-design. To be specific, there are three key designs in ghZCCL: 1 Novel workflow: Pioneering GPU homomorphic compression pipeline. This ultra-fast lossless homomorphic compression pipeline diminishes the needs for complete decompression and recompression while maintaining the same data quality compared with the original DOC workflow, allowing ghZCCL's homomorphic compressor&#208;ghZ achieves extreme compression throughput. 2 Throughput optimization: Fused lightweight compression kernel. It conducts partial decompression, operation, and partial recompression with a single kernel, significantly decreases the kernel launching overheads and increases memory access efficiency. 3 Communication co-design: GPU-centric homomorphic compressioncommunication co-design. This GPU-centric co-design supports different collective computation operations and soundly improves the collective communication efficiency on modern GPU clusters. We evaluate ghZCCL with various application datasets across up-to 512 NVIDIA A100 GPUs and present some key findings below:</p><p>&#8226; ghZ achieves 531.91&#347;942.44 GB/s averaged DOC-handling throughput on NVIDIA A100 GPU that is 3.47&#347;3.89&#215; faster than the current fastest lossy compressor&#208;cuSZp2.</p><p>undergo a Fused Global Synchronization ( 1 ), which retrieves offsets of the two groups of compressed data blocks with a fused device-level parallel prefix-sum. Then, based on the two sets of indexes obtained, ghZ decode the two compressed byte arrays using a Fused Lossless Decoding &amp; Computation ( 2 ) that transforms and combines the byte arrays into a single operated integer array. Then, the GPU threads synchronize with each other in a Resynchronization ( 3 ) to obtain the new compressed bytes offsets. Finally, the integer array is encoded into a compressed byte array through a Lossless Encoding ( 4 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Design Takeaway 2:</head><p>The GPU homomorphic compression workflow of ghZ surpasses the DOC workflow of cuSZp2. By cutting kernel launches from four to one and processing stages from ten to four, it significantly boosts DOC-handling latency and throughput.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">ghZCCL Co-design Overview</head><p>As shown in Figure <ref type="figure">4</ref>, we present a high-level overview of our compression-communication co-design. To fully exploit the unique benefits of GPU homomorphic compression in collective communications, we propose a GPU-centric homomorphic compression-accelerated collective communication framework specifically designed for computation-intensive collective operations. First, we develop Co-designed Algorithms ( 5 ) to improve performance and GPU utilization across diverse input data sizes and GPU counts for GPU-aware homomorphic compression-accelerated collective communications. These algorithms outperform the gZCCL's collective algorithms that rely on the traditional DOC workflow. Next, we further enhance the efficiency of these algorithms with a co-designed In-place ghZ ( 6 ), which effectively minimizes GPU memory usage and data copying overhead. Additional key optimizations include Adaptive Vectorized Memory Access ( 7 ), which allow both ghZ and cuSZp2 to adaptively access GPU main memory in a vectorized manner during intensive collective communications, and Multi-stream Compression ( 8 ), which overlaps compression kernels to significantly reduce runtime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">GH ZCCL: KEY DESIGNS</head><p>In this section, we present the eight key designs of ghZCCL.</p><p>For clarity, the step numbers (e.g., 1 ) in the text continue to refer to Figures <ref type="figure">3</ref> and <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">ghZ: Design Details</head><p>We now explore the design of ghZ in detail. The components 1 &#347; 4 correspond to those labeled 1 &#347; 4 in Figure <ref type="figure">3</ref>. 1 GPU 2 &#8230;&#8230; GPU 2 N GPU N &#8230;&#8230; GPU-aware Homomorphic Compression-accelerated Collectives (ghZCCL Co-design) Fundamental Optimization Further Optimizations Co-designed Algorithms Data Sizes GPU Counts &#8230; In-place ghZ Adaptive Vectorized Mem. Access Multi-stream Compression 5 6 7 8 Communicate with Figure 4: High-level overview of ghZCCL. 5.1.1 Fused Global Synchronization 1 . To directly operate on compressed data inputs, the first step is to retrieve the offsets for each compressed data block through a fused synchronization strategy, enabling further processing. To better introduce this synchronization approach, we first explain the high-level compression workflow. The original data is divided into small blocks (e.g., 32 floating-point data points per block), with each thread compressing a single block per iteration. To ensure memory coalescing, threads within the same warp process neighboring blocks. This cycle repeats 32 times, resulting in each warp compressing a total of 32 &#215; 32 = 1024 blocks. Since compressed data blocks may have varying sizes, the exact locations of each block within the compressed data cannot be predetermined. Consequently, threads must synchronize to communicate and calculate offset information.</p><p>In our ghZ, we employ a Fused Global Synchronization to determine the specific locations of compressed blocks, eliminating the need for independent synchronizations for each input, as required in the DOC workflow. First, each GPU thread calculates the total compressed data offsets for two sets of 32 data blocks using fixed-rate information, which specifies the number of bits used to encode each data point within a block. After determining the thread-level offsets, an inclusive warp-level prefix sum is applied to calculate the two total compressed data sizes for a warp of the compressed byte arrays. This step is optimized using __shfl_up_sync, enabling efficient in-warp communication. Subsequently, the last thread of each warp writes the total compressed data sizes to temporary global memory buffers. An exclusive global-level prefix sum is then performed to compute the global compressed data offsets for each warp in the byte arrays. This process utilizes the decoupled look-back technique described in <ref type="bibr">[13,</ref><ref type="bibr">24]</ref>. By employing Fused Global Synchronization, we significantly reduce synchronization latency compared to the DOC workflow, enhancing overall efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2">Fused Lossless</head><p>Decoding &amp; Computation 2 . In ghZ, we introduce a Fused Lossless Decoding &amp; Computation approach to partially decode compressed byte arrays and perform calculations directly on integer data. In contrast, the traditional DOC workflow requires fully decompressing the compressed data into floating-point arrays and launching an additional GPU kernel to process the two floating-point data inputs, as illustrated in 3. For each data block, the GPU thread first performs warp-level communication to determine the compressed byte offsets of the two compressed data blocks. This is achieved using the previously obtained fixed-rate information and the __shfl_up_sync primitive. If both fixed rates are zero, the two blocks are skipped to save processing time. If at least one block has a non-zero fixed rate, ghZ retrieves the sign flags (e.g., 32 sign bits per block) for the two compressed data blocks in a vectorized manner.</p><p>Next, for each compressed block, ghZ's bit-shuffle-based fixed-length decoder partially decompresses the corresponding compressed bytes into two 32-element integer arrays. It then directly computes using the integer arrays to generate a new sign-bit array, an operated integer array, and an updated fixed rate, all within a single loop optimized with loop unrolling. The fixed-rate information is subsequently stored in the operated compressed data, and __shfl_sync is employed to update the compressed data offsets across different iterations of data blocks. This Fused Lossless Decoding &amp; Computation method delivers significantly higher throughput compared to the DOC workflow, which always requires fully decompressing the two compressed data arrays and operating on floating-point data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.3">Resynchronization and Lossless</head><p>Encoding 3 &amp; 4 . After obtaining the operated integer array, ghZ performs a Resynchronization step to determine the compressed byte offsets of the newly compressed operated data blocks and applies a lossless encoding to encode the integer blocks into compressed bytes. This process is significantly more lightweight compared to the full recompression required by the traditional DOC workflow, as shown in Figure <ref type="figure">3</ref>. The Resynchronization process involves a warp-level synchronization followed by a global-level synchronization, akin to the Fused Global Synchronization in ghZ. During the Lossless Encoding phase, the previously obtained fixed-rate information for each block is utilized to store only the fixed number of bits required for each element in the block (e.g., 4 bits per element). This approach significantly reduces storage space compared to storing the full 32 bits for each element.</p><p>Design Takeaway 3: The lightweight design of ghZ optimizes memory access and reduces computational costs, significantly outperforming the traditional DOC workflow in compression efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">ghZCCL: Co-design Details</head><p>In this section, we delve into the co-design details of ghZCCL, with components 5 &#347; 8 corresponding to those in Figure <ref type="figure">4</ref>. 5 . To effectively leverage homomorphic compression in GPU-aware collective communications, the foundational step is to co-design the collective communication algorithms. The state-of-the-art GPU-aware compression-accelerated collective framework, gZCCL <ref type="bibr">[19]</ref>, was specifically designed for the DOC workflow and is not compatible with homomorphic compression. To address this limitation, we propose different co-designed algorithms (e.g., ringbased and recursive_doubling-based Allreduce) to optimize collective performance across varying data sizes and GPU counts. In this subsection, we use the recursive_doublingbased Allreduce as an exemplar, and the same methodology can be easily applied to other algorithms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1">GPU-aware homomorphic compression-accelerated collective algorithms</head><p>While the ring-based Allreduce is widely employed for processing large messages in leading collective communication libraries such as MPICH <ref type="bibr">[30]</ref> and NCCL <ref type="bibr">[12]</ref>, it can encounter GPU underutilization when the GPU count is large. This inefficiency arises because each GPU processes only &#119863;/&#119873; data per compression task, where &#119863; is the input data size and &#119873; is the number of GPUs <ref type="bibr">[19]</ref>. To address this scalability challenge, we propose the GPU Homomorphic Compression-Accelerated Recursive_Doubling-Based Allreduce Algorithm. In Figure <ref type="figure">5</ref>, we compare the high-level design of ghZCCL with gZCCL in the recursive_doubling-based Allreduce algorithm for four GPUs. In gZCCL, each GPU first compresses its original data and sends the compressed data to the target GPU. Upon receiving the data, the target GPU decompresses it to reconstruct the original data, then launches a reduction kernel to operate on the two original data inputs. After obtaining the reduced result, the data is recompressed into a compressed format using another compression kernel. This DOC workflow repeats log &#119873; -1 times, where &#119873; is the number of GPUs. In the final round, the last received data is decompressed and combined with the previously reduced output. This round does not involve recompression since it produces the final reduced output. If the compression cost of the original data is &#119862;&#119875;&#119877;, the decompression cost is &#119863;&#119875;&#119877;, and the operation cost is &#119874;&#119875;&#119877;, the total computational cost in gZCCL's recursive_doubling-based Allreduce algorithm is: &#119879; &#119860;&#119877; &#119892;&#119885;&#119862;&#119862;&#119871; = log &#119873; &#215; (&#119863;&#119875;&#119877; + &#119874;&#119875;&#119877; + &#119862;&#119875;&#119877;). In contrast, our ghZCCL co-design, built around our GPU homomorphic compressor&#208;ghZ&#208;eliminates the costly DOC workflow used by gZCCL. In ghZCCL, each GPU first concurrently compresses its original data, then exchanges the compressed data with other GPUs. Following the communication step, each GPU directly operates on the compressed data using our GPU homomorphic compressor, bypassing the need to decompress it. The resulting newly operated compressed data is then transmitted among GPUs. This communication and homomorphic compression process repeats for log &#119873; -1 rounds. After the intensive communications, during the final round, the last received compressed data is directly computed with the previously reduced compressed output using the GPU homomorphic compressor. Finally, the reduced compressed data is decompressed to retrieve the original reduced output, completing the algorithm. The total cost of this algorithm is: &#119879; &#119860;&#119877; &#119866;&#119867;&#119862;&#119871; = &#119862;&#119875;&#119877; + (log &#119873; -1) &#215; &#119867;&#119875;&#119877; + &#119867;&#119875;&#119877; + &#119863;&#119875;&#119877; = &#119862;&#119875;&#119877; + log &#119873; &#215; &#119867;&#119875;&#119877; + &#119863;&#119875;&#119877;, where &#119867;&#119875;&#119877; represents the homomorphic processing cost. The cost difference between gZCCL and ghZCCL is: &#119879; &#119860;&#119877; &#119892;&#119885;&#119862;&#119862;&#119871; -&#119879; &#119860;&#119877; &#119866;&#119867;&#119862;&#119871; = log &#119873; (&#119863;&#119875;&#119877; + &#119874;&#119875;&#119877; + &#119862;&#119875;&#119877; -&#119867;&#119875;&#119877;) -&#119862;&#119875;&#119877; -&#119863;&#119875;&#119877;. Since the traditional DOC cost (&#119863;&#119875;&#119877; + &#119874;&#119875;&#119877; + &#119862;&#119875;&#119877;) is significantly higher than &#119867;&#119875;&#119877;, we conclude that ghZCCL achieves much higher collective performance than gZCCL. This example uses four GPUs/processes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5.2.2</head><p>In-place ghZ and Adaptive Vectorized Memory Access 6 &amp; 7 . In this section, we detail the designs of the Inplace ghZ and Adaptive Vectorized Memory Access, both specifically co-designed to meet the needs of GPU-aware collective communications. Initially, the ghZ processes two compressed byte arrays as inputs and directly operates on them to produce an operated compressed byte array, which is stored in an additional GPU buffer. While this design already outperforms the DOC workflow, it leads to suboptimal performance and memory management in collective communication scenarios, particularly on GPUs where memory resources are constrained. In ghZCCL, after a GPU receives compressed data from another GPU, the data is stored in a temporary GPU buffer called tmp_buf. This buffer, along with another input buffer outputBytes (which stores the previously reduced compressed data), is fed into the GPU homomorphic compression kernel. With the original ghZ, an additional GPU buffer would be required to store the newly operated compressed output. Subsequently, the data would need to be copied back to outputBytes for either local storage or further communication. This process increases both the memory footprint and runtime. To address this inefficiency, we co-designed an in-place ghZ, capable of directly writing the homomorphically compressed data into one of the input GPU buffers during the compression process. This approach reduces memory usage and improves runtime efficiency, enhancing performance for GPU-aware collectives.</p><p>We also propose the Adaptive Vectorized Memory Access to enable the vectorized memory access capability for compression tasks (including both normal compression and homomorphic compression) during collective communications to better exploit the GPU global memory bandwidth. In the collective communication scenario, it is common to divide input data into smaller data chunks for data communications. For example, the ring-based Reduce_scatter divides the input data of size &#119863; into &#119873; chunks, where &#119873; is the number of processes/GPUs. Then, each chunk will be compressed and communicated during the intensive communications. However, this can possibly result in the misaligned memory access issue if using vectorized memory access during the compression tasks because the start index for each chunk in the GPU receive buffer may not be a multiple of 4. Figure <ref type="figure">6</ref> illustrates the workflow of Adaptive Vectorized Memory Access during the normal compression process. Prior to the compression task, the data undergoes a three-step preprocessing procedure to prepare it for efficient handling from GPU global memory: <ref type="bibr">(1)</ref> The starting address and length of the data chunk designated for compression and communication are analyzed to determine its suitability for vectorized processing. <ref type="bibr">(2)</ref> If the data chunk is not vectorizable, the remainders at the starting and ending locations of the chunk are identified and retrieved to facilitate scalar operations. <ref type="bibr">(3)</ref> If the data chunk is vectorizable, this step is bypassed, and the data is directly accessed and processed in a vectorized manner before proceeding to cuSZp compression. During the homomorphic compression process, these three steps are executed in the same sequence before the Fused Lossless Decoding &amp; Computation phase and after the Lossless Encoding phase. Figure <ref type="figure">7</ref> presents a running example of a data Checking Memory Access Remainder Processing Vectorized Processing Normal Compression Vectorizable Non-Vectorizable cuSZp2: The fastest GPU error-bounded lossy compressor <ref type="bibr">[24]</ref> ghZ: The proposed first-ever GPU homomorphic compressor Cray MPI: The state-of-the-art MPI library used in 2 of 3 Exascale supercomputers in the world <ref type="bibr">[17,</ref><ref type="bibr">41]</ref> NCCL: The fastest collective communications library for NVIDIA GPUs <ref type="bibr">[12]</ref> gZCCL: The state-of-the-art compression-accelerated collective communications library for GPUs <ref type="bibr">[19]</ref> ghZCCL: The proposed first-ever GPU-aware homomorphic compression-accelerated collective communications library Table <ref type="table">3</ref>: Information of evaluated compression and collective communication solutions. We highlight our solutions and baselines with blue and red colors, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Evaluating GPU Homomorphic Compressor&#347;ghZ</head><p>In this section, we conduct a comprehensive evaluation of our ghZ, focusing on DOC-handling throughput, compression ratio, and compression quality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1">DOC-handling throughput of ghZ.</head><p>In Figure <ref type="figure">8</ref>, we evaluate the DOC-handling throughput of ghZ compared to the DOC workflow using the fastest error-bounded lossy compressor, cuSZp2. Across all application datasets, ghZ consistently outperforms cuSZp2, regardless of the relative error bounds. On average, ghZ achieves DOC-handling throughputs ranging from 531.91 to 942.44 GB/s, while cuSZp2 achieves compression throughputs between 153.32 and 252.24 GB/s, making it 3.47&#347;3.89&#215; slower than ghZ. Notably, ghZ demonstrates its highest throughput with the JetIn application dataset, reaching 1323.40 GB/s with a 1E-1 error bound and 1056.89 GB/s with a 1E-4 error bound. These values far exceed the 302.45 GB/s and 294.52 GB/s achieved by cuSZp2, corresponding to speedups of 4.38&#215; and 3.59&#215;, respectively. This superior performance can be attributed to the high sparsity of the JetIn, which consists of many zero data blocks (i.e., blocks containing only zero values). In ghZ, these zero blocks are skipped by directly setting the lossless decoded integer values to zero, avoiding unnecessary retrievals from the compressed data inputs. This adaptive homomorphic compression strategy further boosts the speed of ghZ.</p><p>Additionally, we observe that both ghZ and cuSZp2 experience lower compression throughputs as the error bound decreases. This is because smaller error bounds make the data harder to compress, increasing computational and memory access costs, which in turn reduces performance. However, even with a 1E-4 error bound, ghZ maintains a significantly higher throughput than cuSZp2 with a 1E-1 error bound when processing the same application dataset. For example, when processing the NYX application dataset, ghZ achieves a throughput of 382.64 GB/s with a 1E-4 error bound, while cuSZp2 reaches only 238.10 GB/s with a far less restrictive 1E-1 error bound. This highlights that our ghZ can effectively tackle considerably more challenging compression scenarios while achieving higher compression performance compared to cuSZp2, even when cuSZp2 operates under much simpler compression conditions. This advantage greatly expands the potential use cases of ghZ in DOC and similar workflows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6.2.2</head><p>Compression ratio and quality of ghZ. In Table <ref type="table">4</ref>, we evaluate the compression ratio and quality of ghZ across a range of application datasets. Since ghZ operates losslessly, any compression accuracy loss stems solely from the already lossyly compressed data inputs. Consequently, the compression ratio and quality of ghZ are identical to those of the traditional DOC workflow with cuSZp2, as confirmed by our comprehensive experiments. Thus, we only report values of ghZ in Table <ref type="table">4</ref>. This demonstrates that ghZ maintains the same compression ratio and quality as the traditional DOC workflow while significantly outperforming it in compression performance, as shown in 6.2.1.</p><p>The evaluation results reveal that ghZ exhibits varying compression ratios across different application datasets and relative error bounds. For a 1E-1 error bound, the average compression ratio ranges from 35.28 to 127.94 across datasets. For a more restrictive 1E-4 relative error bound, the average compression ratio ranges from 3.81 to 77.44. This trend indicates that smaller error bounds result in lower compression ratios because more data features must be preserved, making the data harder to compress.</p><p>Regarding compression quality, ghZ achieves excellent results across all application datasets. For example, in the JetIn dataset, the Peak Signal-to-Noise Ratio (PSNR) ranges from 66.58 to 101.61 for error bounds between 1E-1 and 1E-4, indicating high-quality compression. Additionally, smaller error bounds lead to higher compression quality because more data features are preserved. These findings confirm that ghZ delivers high compression throughput and impressive compression ratios without sacrificing data accuracy.</p><p>Evaluation Takeaway 1: On average, cuSZp2 achieves 153.32&#347;252.24 GB/s, while ghZ reaches 531.91&#347;942.44 GB/s, delivering a 3.47&#347;3.89&#215; speedup without compromising compression ratio or accuracy, due to its lossless homomorphic compression design.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Comparing ghZCCL with SOTA Collective Communications Libraries</head><p>After demonstrating the high compression performance of ghZCCL's ghZ, we now evaluate the performance of ghZCCLaccelerated collective communications on 64 NVIDIA A100 6.3.1 Reduce. In Figure <ref type="figure">9</ref>, we evaluate ghZCCL against multiple baselines using the Reduce operation, with speedups measured relative to Cray MPI. The results show that ghZCCL consistently outperforms all counterparts across all data sizes. Compared to the second-best solution, gZCCL, ghZCCL achieves a 1.34&#215; speedup at 600 MB. This improvement stems from ghZCCL's ability to significantly reduce DOC-related overheads by co-designing GPU homomorphic compression with collective communications. Furthermore, ghZCCL is up to 3.85&#215; and 188&#215; faster than NCCL and Cray MPI, respectively. The substantial performance gain over NCCL is attributed to ghZCCL's ability to reduce overall communication volume and mitigate network congestion through its ultra-fast homomorphic compression. The improvement is even more pronounced compared to Cray MPI, as its Reduce operation is not fully GPU-centric, leading to significant device-host data transfer and CPU computation overheads.</p><p>6.3.2 Allreduce. Figure 10 presents a performance comparison between ghZCCL and other state-of-the-art communication libraries using the Allreduce collective operation. Similar to the observations in Section 6.3.1, ghZCCL consistently outperforms all baselines, achieving up to 8.55&#215; performance improvement over Cray MPI. When compared to NCCL, ghZCCL achieves a 3.21&#215; speedup, primarily due to its significantly improved communication efficiency enabled by lightweight homomorphic compression, which reduces the amount of data transmitted. In contrast, NCCL lacks this capability and must communicate with uncompressed data. Among all the solutions, gZCCL delivers the secondbest performance; however, it still underperforms ghZCCL by up to 2.19&#215;. This performance gap arises from gZCCL's substantial decompression and recompression overheads, whereas ghZCCL directly operates on compressed data without decompression. Additionally, we observe that Cray MPI exhibits relatively better performance in Allreduce compared to its performance in Reduce (Section 6.3.1). This is because Cray MPI is specifically optimized to make Allreduce GPUcentric, as it is the most widely used collective operation. 8.55 3.21X Figure 10: Performance evaluation of ghZCCL-accelerated Allreduce against SOTA baselines in different data sizes. 6.4 Evaluating the Scalability of ghZCCL To further evaluate the performance of ghZCCL, Figure 11 analyzes its scalability on 512 NVIDIA A100 GPUs. The results demonstrate that ghZCCL maintains strong scalability across varying GPU counts, significantly outperforming baseline solutions. In Subfigure 11a, ghZCCL achieves up to 183&#215; speedup over Cray MPI and 5.81&#215; over NCCL. Additionally, it outperforms gZCCL by up to 1.34&#215;, exhibiting better scalability than the previously best compression-accelerated communication solution. A similar trend is observed in Subfigure 11b, where ghZCCL surpasses gZCCL by 2.29&#215; on 512 GPUs. Moreover, ghZCCL achieves even greater performance improvements over Cray MPI and NCCL, with up to 5.96&#215; and 4.88&#215; speedups, respectively. This superior performance is attributed to ghZCCL's ability to directly operate on and communicate with compressed data, effectively reducing communication overhead and optimizing compression to improve overall runtime efficiency. Evaluation Takeaway 2: Evaluated on up to 512 NVIDIA A100 GPUs, ghZCCL significantly outperforms state-ofthe-art communication libraries, achieving speedups of up to 2.29&#215;, 5.81&#215;, and 188&#215; compared to gZCCL, NCCL, and Cray MPI, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Image Stacking Performance and Accuracy Analysis</head><p>In this section, we use the image stacking application to evaluate both the performance and accuracy of the proposed ghZCCL. Image stacking is widely used in various scientific fields, including atmospheric science and geology, to generate high-resolution images by combining multiple individual images. This process involves an Allreduce operation. As highlighted by Gurhem in <ref type="bibr">[15]</ref>, researchers utilize MPI to merge these individual images into final composite images. Table <ref type="table">5</ref> shows that ghZCCL significantly outperforms both NCCL and gZCCL, achieving a 2.34&#215; speedup over NCCL, whereas gZCCL only reaches a 1.23&#215; speedup. Since Cray MPI consistently underperforms NCCL, we omit its results to save space. To gain deeper insight into the performance differences, we break down the runtime of gZCCL and ghZCCL. The results reveal that gZCCL's runtime is primarily dominated by the compression and operation time</p><p>Methods Speedups Compr.+Oper. Comm. Others gZCCL 1.23 82.57% 14.80% 2.63% ghZCCL 2.34 69.45% 28.41% 2.14% NCCL 1 No breakdown because of complexity</p><p>Table 5: Performance analysis of image stacking. The speedups are based on NCCL and the last three columns represent performance breakdown.</p><p>(Compr.+Oper.), accounting for 82.57% of the total execution time. This substantial DOC overhead is successfully mitigated in ghZCCL, reducing the proportion to 69.45%&#208;or 36.37% relative to gZCCL's total runtime. These improvements stem from our ultra-fast homomorphic compressioncommunication co-design, which enables GPUs to directly communicate and compute with compressed data, significantly reducing the overhead associated with traditional DOC workflows.</p><p>After demonstrating the high performance of ghZCCL, we further evaluate its numerical accuracy (PSNR and NRMSE) and visual quality. With an absolute error bound of 1E-4, ghZCCL achieves an impressive PSNR of 73.60 and an excellent NRMSE of 2.1E-4. Figure <ref type="figure">12</ref> presents a visual comparison of stacking images using ghZCCL and the original uncompressed NCCL method. The comparison reveals no visual differences between the two images, confirming that ghZCCL effectively preserves image quality. This combination of high numerical accuracy and visual quality underscores the effectiveness of ghZCCL in delivering superior performance while maintaining exceptionally high data quality.</p><p>(a) NCCL (lossless) (b) ghZCCL </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION AND FUTURE WORK</head><p>In this paper, we introduce ghZCCL, a novel GPU-aware homomorphic compression-accelerated collective communications library that enables direct GPU computation and communication on compressed data. Through evaluations on up to 512 NVIDIA A100 GPUs and 7 application datasets, ghZCCL significantly outperforms state-of-the-art compression and communication libraries: achieving speedups of up to 3.89&#215; over cuSZp2, 2.29&#215; over gZCCL, 5.81&#215; over NCCL, and 188&#215; over Cray MPI. Moving forward, we plan to extend ghZCCL to additional hardware platforms, including AI accelerators (e.g., Groq LPU, SambaNova RDU) and FP-GAs, further broadening its impact on both compression and communication across diverse computing architectures.</p></div></body>
		</text>
</TEI>
