skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, September 13 until 2:00 AM ET on Saturday, September 14 due to maintenance. We apologize for the inconvenience.


Search for: All records

Award ID contains: 2016701

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem.

    To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive end-to-end support, superior performance, and hardware scalability. Existing solutions that translate CUDA source code into other high-level languages, however, fall short of these goals.

    In contrast to these source-to-source approaches, we present a novel framework, CuPBoP , which treats CUDA as a portable language in its own right. Compared to two commercial source-to-source solutions, CuPBoP offers a broader coverage and superior performance for the CUDA-to-CPU migration. Additionally, we evaluate the performance of CuPBoP against manually optimized CPU programs, highlighting the differences between CPU programs derived from CUDA and those that are manually optimized.

    Furthermore, we demonstrate the hardware scalability of CuPBoP by showcasing its successful migration of CUDA to AMD GPUs.

    To promote further research in this field, we have released CuPBoP as an open-source resource.

     
    more » « less
    Free, publicly-accessible full text available July 31, 2025
  2. The two largest barriers to adoption of FPGA platforms for HPC applications are the difficulty of programming FPGAs and the performance gap when compared to GPUs. To address the first barrier, new ecosystems like Intel oneAPI, and Xilinx Vitis HLS aim to improve programmability for FPGA platforms. From a performance aspect, FPGAs trade off lower compute frequencies for more customized hardware acceleration and power efficiency when compared to GPUs. The performance for memory-bound applications on recent GPU platforms like NVIDIA’s H100 and AMD’s MI210 has also improved due to the inclusion of high-bandwidth memories (HBM), and newer FPGA platforms are also starting to include HBM in addition to traditional DRAM. To understand the current state-of-the-art and performance differences between FPGAs and GPUs, we consider realized memory bandwidth for recent FPGA and GPU platforms. We utilize a custom STREAM benchmark to evaluate two Intel FPGA platforms, the Stratix 10 SX PAC and Bittware 520N-MX, two AMD/Xilinx FPGA platforms, the Alveo U250 and Alveo U280, as well as GPU platforms from NVIDIA and AMD. We also extract power measurements and estimate memory bandwidth per Watt ((GB/s)/W) on these platforms to evaluate how FPGAs compare against GPU execution. While the GPUs far exceed the FPGAs in raw performance, the HBM equipped FPGAs demonstrate a competitive performance-power balance for larger data sizes that can be easily implemented with oneAPI and Vitis HLS kernels. These findings suggest a potential sweet spot for this emerging FPGA ecosystem to serve bandwidth limited applications in an energy-efficient fashion. 
    more » « less
  3. Current state-of-the-art resource management systems leverage Machine Learning (ML) methods to enable the efficient use of heterogeneous memory hardware, deployed across emerging computing platforms. While machine intelligence can be effectively used to learn and predict complex data access patterns of modern analytics, the use of ML over the exploded data sizes and memory footprints is prohibitive for its practical system-level integration. For this reason, recent solutions use existing lightweight historical information to predict the access behavior of majority of the application pages, and train ML models over a small page subset. To maximize application performance improvements, the pages selected for machine learning-based management are identified with elaborate page selection methods. These methods involve the calculation of detailed performance estimates depending on the configuration of the hybrid memory platform. This paper aims to reduce such vast operational overheads, that further exacerbate the existing high overheads of using machine intelligence, in return for high performance and efficiency. To this end, we build Cronus, an image-based pipeline for selecting pages for ML-based management. We visualize memory access patterns and reveal spatial and temporal correlations among the selected pages, that current methods fail to leverage. We then use the created images to detect patterns and select page groups for machine learning model deployment. Cronus drastically reduces the operational costs, while preserving the effectiveness of the page selection and achieved performance of machine intelligent hybrid memory management. This work makes a case that visualization and computer vision methods can unlock new insights and reduce the operational complexity of emerging systems solutions. 
    more » « less
  4. Disaggregated memory is being proposed as a way to provide efficient memory scaling for data intensive applications. High performance interconnect technologies, such as CXL, make disaggregated, fabric-attached-memory (FAM) a viable secondary tier of memory. Previous work on remote memory relies on extending kernel level paging to utilize FAM as an additional storage tier after local memory. These approaches have the advantage of exposing remote memory in application transparent ways that do not require code changes, but they incur large overheads due to the mismatch between the abstraction of a flat virtual address space and the reality of the tiered nature of FAM. In this paper, we present an alternative approach to remote memory based on application-specific objects. We design FAM-Graph - a semi-external graph processing system that leverages application-level properties, such as read only edge data, to efficiently tier data between local and remote memory, and prefetch remote data for local computation. Using several graph algorithms and datasets, we demonstrate that FAM-Graph achieves end-to-end performance within factors of 1–6× of Galois, the state of the art shared memory graph processing system, while using up to 20× less local memory. When Galois is used in conjunction with an OS-level FAM solution, we show that FAM-Graph achieves better end-to-end performance by up to 9× when both systems are configured with the same amount of local memory. 
    more » « less
  5. Emerging workloads benefit from massive memory capacities provided by hybrid memory platforms. Recent system-level hybrid memory management solutions integrate machine learning methods to better predict complex data access behaviors. Given the substantial associated learning overheads, such solutions train parallel recurrent neural networks to learn the access patterns at the granularity of a page for a carefully selected page subset. Our observation reveals that the size of this subset varies immensely across workload classes, sizes and patterns. Increasing the granularity at the level of a page group will help reduce the aggregate learning overheads. Yet, unsupervised machine learning clustering methods are not practical to use in this context. Instead, this paper builds Coeus - a page grouping mechanism for machine learning-based hybrid memory management. Coeus is simple, robust and efficient. Coeus leverages data reuse insights to fine-tune the granularity at which patterns are interpreted by the system. As a result, Coeus creates large clusters of pages that share the same access behavior, in a practical way. Coeus reduces by almost 3x the associated learning overheads. In addition, Coeus achieves 3x higher application performance, by the combined effects of applying machine learning to more pages and by performing management operations at better granularity, compared to configurations of existing hybrid memory managers. 
    more » « less