NA
                            (Ed.)
                        
                    
            
                            While parallelism remains the main source of performance,architectural implementations and programming modelschange with each new hardware generation, often leadingto costly application re-engineering. Most tools for perfor-mance portability require manual and costly application port-ing to yet another programming model.We propose an alternative approach that automaticallytranslates programs written in one programming model(CUDA), into another (CPU threads) based on Polygeist/MLIR.Our approach includes a representation of parallel constructsthat allows conventional compiler transformations to ap-ply transparently and without modification a nd enablesparallelism-specific optimizations. We evaluate our frame-work by transpiling and optimizing the CUDA Rodinia bench-mark suite for a multi-core CPU and achieve a 58% geomeanspeedup over handwritten OpenMP code. Further, we showhow CUDA kernels from PyTorch can efficiently run andscale on the CPU-only Supercomputer Fugaku without userintervention. Our PyTorch compatibility layer making use oftranspiled CUDA PyTorch kernels outperforms the PyTorchCPU native backend by 2.7×. 
                        more » 
                        « less   
                     An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    