NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FIER: Fine-Grained and Efficient KV Cache Retrieval for Long-context LLM Inference

Wang, Dongwei; Liu, Zijie; Wang, Song; Ren, Yuxin; Deng, Jianing; Hu, Jingtong; Chen, Tianlong; Yang, Huanrui (November 2025, Association for Computational Linguistics)

Free, publicly-accessible full text available November 4, 2026
MTrain: Enable Efficient CNN Training on Heterogeneous FPGA-Based Edge Servers

https://doi.org/10.1109/TCAD.2025.3541486

Tang, Yue; Jones, Alex K; Xiong, Jinjun; Zhou, Peipei; Hu, Jingtong (January 2025, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

FPGA-based edge servers are used in many applications in smart cities, hospitals, retail, etc. Equipped with heterogeneous FPGA-based accelerator cards, the servers can be implemented with multiple tasks including efficient video prepossessing, machine learning algorithm acceleration, etc. These servers are required to implement inference during the daytime while re-training the model during the night to adapt to new environments, domains, or new users. During the re-training, conventionally, the incoming data are transmitted to the cloud, and then the updated machine learning models will be transferred back to the edge server. Such a process is inefficient and cannot protect users’ privacy, so it is desirable for the models to be directly trained on the edge servers. Deploying convolutional neural network (CNN) training on heterogeneous resource-constrained FPGAs is challenging since it needs to consider both the complex data dependency of the training process and the communication bottleneck among different FPGAs. Previous multi-accelerator training algorithms select optimal scheduling strategies for data parallelism, tensor parallelism, and pipeline parallelism. However, pipeline parallelism cannot deal with batch normalization (BN) which is an essential CNN operator, while purely applying data parallelism and tensor parallelism suffers from resource under-utilization and intensive communication costs. In this work, we propose MTrain, a novel multi-accelerator training scheduling strategy that transfers the training process into a multi-branch workflow, thus independent sub-operations of different branches are executed on different training accelerators in parallelism for better utilization and reduced communication overhead. Experimental results show that we can achieve efficient CNN training on heterogeneous FPGA-based edge servers with 1.07x-2.21x speedup under 15 GB/s peer-to-peer bandwidth compared to the state-of-the-art work.
more » « less
Full Text Available
CHEF: A Framework for Deploying Heterogeneous Models on Clusters with Heterogeneous FPGAs

Tang, Yue; Song, Yukai; Elango, Naveena; Priya, Sheena R; Jones, Alex K; Xiong, Jinjun; Zhou, Peipei; Hu, Jingtong (October 2024, IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS)

Full Text Available
EQ-ViT: Algorithm-Hardware Co-Design for End-to-End Acceleration of Real-Time Vision Transformer Inference on Versal ACAP Architecture

Dong, P; Zhuang, J; Yang, Z; Ji, S; Li, Y; Xu, D; Huang, H; Hu, J; Jones, A K; Shi, Y; et al (September 2024, IEEE transactions on computeraided design of integrated circuits and systems)

Full Text Available
CHARM 2.0: Composing Heterogeneous Accelerators for Deep Learning on Versal ACAP Architecture

https://doi.org/10.1145/3686163

Zhuang, Jinming; Lau, Jason; Ye, Hanchen; Yang, Zhuoping; Ji, Shixin; Lo, Jack; Denolf, Kristof; Neuendorffer, Stephen; Jones, Alex; Hu, Jingtong; et al (August 2024, ACM Transactions on Reconfigurable Technology and Systems)

Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises:How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch between massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to composemultiple diverse MM accelerator architecturesworking concurrently on different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29\(\times\), 32.51\(\times\), 1.00\(\times\), and 1.00\(\times\)throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this paper and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems:https://github.com/arc-research-lab/CHARM.
more » « less
Full Text Available
Sustainable AI Processing at the Edge

https://doi.org/10.1109/MM.2022.3220399

Ollivier, Sebastien; Li, Sheng; Tang, Yue; Cahoon, Stephen; Caginalp, Ryan; Chaudhuri, Chayanika; Zhou, Peipei; Tang, Xulong; Hu, Jingtong; Jones, Alex K. (January 2023, IEEE Micro)

Full Text Available
Toward Comprehensive Shifting Fault Tolerance for Domain-Wall Memories with PIETT

https://doi.org/10.1109/TC.2022.3188206

Ollivier, Sebastien; Longofono, Stephen; Dutta, Prayash; Hu, Jingtong; Bhanja, Sanjukta; Jones, Alex K. (July 2022, IEEE Transactions on Computers)

Full Text Available
POD-RACING: Bulk-Bitwise to Floating-Point Compute in Racetrack Memory for Machine Learning At the Edge

https://doi.org/10.1109/MM.2022.3195761

Ollivier, Sebastien; Zhang, Xinyi; Tang, Yue; Choudhuri, Chayanika; Hu, Jingtong; Jones, Alex K. (January 2022, IEEE Micro)

Full Text Available

Search for: All records