NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

AGILE: Lightweight and Efficient Asynchronous GPU-SSD Integration

Yang, Zhuoping; Zhuang, Jinming; Chen, Xingzhen; Jones, Alex; Zhou, Peipei (November 2025, ACM)

GPUs are critical for compute-intensive applications, yet emerging workloads such as recommender systems, graph analytics, and data analytics often exceed GPU memory capacity. Existing solutions allow GPUs to use CPU DRAM or SSDs as external memory, and the GPU-centric approach enables GPU threads to directly issue NVMe requests, further avoiding CPU intervention. However, current GPU-centric approaches adopt synchronous I/O, forcing threads to stall during long communication delays. We propose AGILE, a lightweight asynchronous GPU-centric I/O library that eliminates deadlock risks and integrates a flexi- ble HBM-based software cache. AGILE overlaps computation and I/O, improving performance by up to 1.88×across workloads with diverse computation-to-communication ratios. Compared to BaM on DLRM, AGILE achieves up to 1.75×speedup through efficient design and overlapping; on graph applications, AGILE reduces soft- ware cache overhead by up to 3.12×and NVMe I/O overhead by up to 2.85×; AGILE also lowers per-thread register usage by up to 1.32×.
more » « less
Free, publicly-accessible full text available November 16, 2026
ART: Customizing Accelerators for DNN-Enabled Real-Time Safety-Critical Systems

https://doi.org/10.1145/3716368.3735215

Ji, Shixin; Chen, Xingzhen; Zhuang, Jinming; Zhang, Wei; Yang, Zhuoping; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex; Dong, Zheng; et al (June 2025, ACM)

Real-time systems are widely applied in different areas like autonomous vehicles, where safety is the key metric. However, on the FPGA platform, most of the prior accelerator frameworks omit discussing the schedulability in such real-time safety-critical systems, leaving deadlines unmet, which can lead to catastrophic system failures. To address this, we propose the ART framework, a hardware-software co-design approach that transforms baseline accelerators into “real-time guaranteed" accelerators. On the software side, ART performs schedulability analysis and preemption point placement, optimizing task scheduling to meet deadlines and enhance throughput. On the hardware side, ART integrates the Global Earliest Deadline First (GEDF) scheduling algorithm, implements preemption, and conducts source code transformation to transform baseline HLS-based accelerators into designs targeted for real-time systems capable of saving and resuming tasks. ART also includes integration, debugging, and testing tools for full-system implementation. We demonstrate the methodology of ART on two kinds of popular accelerator models and evaluate on AMD Versal VCK190 platform, where ART meets schedulability requirements that baseline accelerators fail. ART is lightweight, utilizing <0.5% resources. With about 100 lines of user input, ART generates about 2.5k lines of accelerator code, making it a push-button solution.
more » « less
Free, publicly-accessible full text available June 29, 2026
ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

https://doi.org/10.1145/3706628.3708870

Zhuang, Jinming; Xiang, Shaojie; Chen, Hongzheng; Zhang, Niansong; Yang, Zhuoping; Mao, Tony; Zhang, Zhiru; Zhou, Peipei (February 2025, ACM)

As AI continues to grow, modern applications are becoming more data- and compute-intensive, driving the development of specialized AI chips to meet these demands. One example is AMD's AI Engine (AIE), a dedicated hardware system that includes a 2D array of high-frequency very-long instruction words (VLIW) vector processors to provide high computational throughput and reconfigurability. However, AIE's specialized architecture presents tremendous challenges in programming and compiler optimization. Existing AIE programming frameworks lack a clean abstraction to represent multi-level parallelism in AIE; programmers have to figure out the parallelism within a kernel, manually do the partition, and assign sub-tasks to different AIE cores to exploit parallelism. These significantly lower the programming productivity. Furthermore, some AIE architectures include FPGAs to provide extra flexibility, but there is no unified intermediate representation (IR) that captures these architectural differences. As a result, existing compilers can only optimize the AIE portions of the code, overlooking potential FPGA bottlenecks and leading to suboptimal performance. To address these limitations, we introduce ARIES, an agile multi-level intermediate representation (MLIR) based compilation flow for reconfigurable devices with AIEs. ARIES introduces a novel programming model that allows users to map kernels to separate AIE cores, exploiting task- and tile-level parallelism without restructuring code. It also includes a declarative scheduling interface to explore instruction-level parallelism within each core. At the IR level, we propose a unified MLIR-based representation for AIE architectures, both with or without FPGA, facilitating holistic optimization and better portability across AIE device families. For the General Matrix Multiply (GEMM) benchmark, ARIES achieves 4.92 TFLOPS, 15.86 TOPS, and 45.94 TOPS throughput under FP32, INT16, and, INT8 data types on Versal VCK190 respectively. Compared with the state-of-the-art (SOTA) work CHARM for AIE, ARIES improves the throughput by 1.17x, 1.59x, and 1.47x correspondingly. For ResNet residual layer, ARIES achieves up to 22.58x speedup compared with optimized SOTA work Riallto on Ryzen-AI NPU. ARIES is open-sourced on GitHub: https://github.com/arc-research-lab/Aries.
more » « less
Free, publicly-accessible full text available February 27, 2026
Towards Accelerator Customization in Real-time Safety-critical Systems

https://doi.org/10.1145/3706628.3708841

Ji, Shixin; Chen, Xingzhen; Zhang, Wei; Yang, Zhuoping; Zhuang, Jinming; Schultz, Sarah; Song, Yukai; Hu, Jingtong; Jones, Alex K; Dong, Zheng; et al (February 2025, ACM)

Free, publicly-accessible full text available February 27, 2026
Amortizing Embodied Carbon Across Generations

Ji, Shixin; Zhuang, Jinming; Yang, Zhuoping; Jones, Alex; Zhou, Peipei (November 2024, IEEE)

Data centers have been relying on renewable energy integration coupled with energy efficient specialized processing units and accelerators to increase sustainability. Unfortunately, the carbon generated from manufacturing these systems is be- coming increasingly relevant due to these energy decarbonization and efficiency improvements. Furthermore, it is less clear how to mitigate this aspect of embodied carbon. As workloads continue to evolve over each hardware generation we explore the tradeoffs of fabricating new application-tuned hardware compared with more general solutions such as Field Programmable Gate Arrays (FPGAs). We also explore how REFRESH FPGAs can amortize embodied carbon investments from previous generations to meet the requirements of future generations workloads.
more » « less
Free, publicly-accessible full text available November 2, 2025
Reducing Smart Phone Environmental Footprints with In-Memory Processing

Yang, Zhuoping; Zhang, Wei; Ji, Shixin; Zhou, Peipei; Jones, Alex (October 2024, IEEE)

Smart phones have revolutionized the availability of computing to the consumer. Recently, smart phones have been aggressively integrating artificial intelligence (AI) capabilities into their devices. The custom designed processors for the latest phones integrate incredibly capable and energy efficient graphics processors (GPUs) and tensor processors (TPUs) to accommodate this emerging AI workload and on-device inference. Unfor- tunately, smart phones are far from sustainable and have a substantial carbon footprint that continues to be dominated by environmental impacts from their manufacture and far less so by the energy required to power their operation. In this paper we explore the possibility of reversing the trend to increase the dedicated silicon dedicated to emerging application workloads in the phone. Instead we consider how in-memory processing using the DRAM already present in the phone could be used in place of dedicated GPU/TPU devices for AI inference. We explore the potential savings in embodied carbon that could be possible with this tradeoff and provide some analysis of the potential of in- memory computing to compete with these accelerators. While it may not be possible to achieve the same throughput, we suggest that the responsiveness to the user may be sufficient using in- memory computing, while both the embodied and operational carbon footprints could be improved. Our approach can save circa 10–15kgCO2e.
more » « less
Full Text Available
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

https://doi.org/10.1109/TCAD.2024.3443692

Dong, Peiyan; Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Li, Yanyu; Xu, Dongkuan; Huang, Heng; Hu, Jingtong; Jones, Alex K; Shi, Yiyu; et al (November 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Free, publicly-accessible full text available November 1, 2025
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

Dong, Peiyan; Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Li, Yanyu; Xu, Dongkuan; Huang, Heng; Hu, Jingtong; Jones, Alex K; Shi, Yiyu; et al (October 2024, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS)

While Vision Transformers (ViTs) have shown consistent progress in computer vision, deploying them for real-time decision-making scenarios (< 1 ms) is challenging. Current computing platforms like CPUs, GPUs, or FPGA-based solutions struggle to meet this deterministic low-latency real-time requirement, even with quantized ViT models. Some approaches use pruning or sparsity to reduce model size and latency, but this often results in accuracy loss. To address the aforementioned constraints, in this work, we propose EQ-ViT, an end-to-end acceleration framework with novel algorithm and architecture co-design features to enable real-time ViT acceleration on AMD Versal Adaptive Compute Acceleration Platform (ACAP). The contributions are four-fold. First, we perform in-depth kernel- level performance profiling & analysis and explain the bottlenecks for existing acceleration solutions on GPU, FPGA, and ACAP. Second, on the hardware level, we introduce a new spatial and heterogeneous accelerator architecture, EQ-ViT architec- ture. This architecture leverages the heterogeneous features of ACAP, where both FPGA and artificial intelligence engines (AIEs) coexist on the same system-on-chip (SoC). Third, On the algorithm level, we create a comprehensive quantization-aware training strategy, EQ-ViT algorithm. This strategy concurrently quantizes both weights and activations into 8-bit integers, aiming to improve accuracy rather than compromise it during quanti- zation. Notably, the method also quantizes nonlinear functions for efficient hardware implementation. Fourth, we design EQ- ViT automation framework to implement the EQ-ViT architec- ture for four different ViT applications on the AMD Versal ACAP VCK190 board, achieving accuracy improvement with 2.4%, and average speedups of 315.0x, 3.39x, 3.38x, 14.92x, 59.5x, 13.1x over computing solutions of Intel Xeon 8375C vCPU, Nvidia A10G, A100, Jetson AGX Orin GPUs, and AMD ZCU102, U250 FPGAs. The energy efficiency gains are 62.2x, 15.33x, 12.82x, 13.31x, 13.5x, 21.9x.
more » « less
Full Text Available
SCARIF: Towards Carbon Modeling of Cloud Servers with Accelerators

https://doi.org/10.1109/ISVLSI61997.2024.00095

Ji, Shixin; Yang, Zhuoping; Chen, Xingzhen; Cahoon, Stephen; Hu, Jingtong; Shi, Yiyu; Jones, Alex K; Zhou, Peipei (July 2024, IEEE)

Embodied carbon has been widely reported as a significant component in the full system lifecycle of various computing systems green house gas emissions. Many efforts have been undertaken to quantify the elements that comprise this embodied carbon, from tools that evaluate semiconductor manufacturing to those that can quantify different elements of the computing system from commercial and academic sources. However, these tools cannot easily reproduce results reported by server vendors' product carbon reports and the accuracy can vary substantially due to various assumptions. Furthermore, attempts to determine green house gas contributions using bottom-up methodologies often do not agree with system-level studies and are hard to rectify. Nonetheless, given there is a need to consider all contributions to green house gas emissions in datacenters, we propose SCARIF, the Server Carbon including Accelerator Reporter with Intelligence-based Formulation tool. SCARIF has three main contributions: (1) We first collect reported carbon cost data from server vendors and design statistic models to predict the embodied carbon cost so that users can get the embodied carbon cost for their server configurations. (2) We provide embodied carbon cost if users configure servers with accelerators including GPUs, and FPGAs. (3) By using case studies, we show that certain design choices of data center management might flip by the insight and observation from using SCARIF. Thus, SCARIF provides an opportunity for large-scale datacenter and hyperscaler design. We release SCARIF as an open-source tool at https://github.com/arc-research-lab/SCARIF.
more » « less
Full Text Available
SSR: Spatial Sequential Hybrid Architecture for Latency Throughput Tradeoff in Transformer Acceleration

Zhuang, Jinming; Yang, Zhuoping; Ji, Shixin; Huang, Heng; Jones, Alex; Hu, Jingtong; Shi, Yiyu; Zhou, Peipei (June 2024, The 32nd ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA 2024))

Full Text Available

« Prev Next »

Search for: All records