NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Graph Learning at Scale: Characterizing and Optimizing Pre-Propagation GNNs

Yue, Zichao; Deng, Chenhui; Zhang, Zhiru (May 2025, The Annual Conference on Machine Learning and Systems (MLSys))

Free, publicly-accessible full text available May 12, 2026
CirSTAG: Circuit Stability Analysis on Graph-based Manifolds

https://doi.org/10.1109/DAC63849.2025.11132637

Cheng, Wuxinlin; Yuan, Yihang; Deng, Chenhui; Aghdaei, Ali; Zhang, Zhiru; Feng, Zhuo (June 2025, IEEE)

Free, publicly-accessible full text available June 22, 2026
SmoothE: Differentiable E-Graph Extraction

https://doi.org/10.1145/3669940.3707262

Cai, Yaohui; Yang, Kaixin; Deng, Chenhui; Yu, Cunxi; Zhang, Zhiru (March 2025, ACM)

Free, publicly-accessible full text available March 30, 2026
ARIES: An Agile MLIR-Based Compilation Flow for Reconfigurable Devices with AI Engines

https://doi.org/10.1145/3706628.3708870

Zhuang, Jinming; Xiang, Shaojie; Chen, Hongzheng; Zhang, Niansong; Yang, Zhuoping; Mao, Tony; Zhang, Zhiru; Zhou, Peipei (February 2025, ACM)

As AI continues to grow, modern applications are becoming more data- and compute-intensive, driving the development of specialized AI chips to meet these demands. One example is AMD's AI Engine (AIE), a dedicated hardware system that includes a 2D array of high-frequency very-long instruction words (VLIW) vector processors to provide high computational throughput and reconfigurability. However, AIE's specialized architecture presents tremendous challenges in programming and compiler optimization. Existing AIE programming frameworks lack a clean abstraction to represent multi-level parallelism in AIE; programmers have to figure out the parallelism within a kernel, manually do the partition, and assign sub-tasks to different AIE cores to exploit parallelism. These significantly lower the programming productivity. Furthermore, some AIE architectures include FPGAs to provide extra flexibility, but there is no unified intermediate representation (IR) that captures these architectural differences. As a result, existing compilers can only optimize the AIE portions of the code, overlooking potential FPGA bottlenecks and leading to suboptimal performance. To address these limitations, we introduce ARIES, an agile multi-level intermediate representation (MLIR) based compilation flow for reconfigurable devices with AIEs. ARIES introduces a novel programming model that allows users to map kernels to separate AIE cores, exploiting task- and tile-level parallelism without restructuring code. It also includes a declarative scheduling interface to explore instruction-level parallelism within each core. At the IR level, we propose a unified MLIR-based representation for AIE architectures, both with or without FPGA, facilitating holistic optimization and better portability across AIE device families. For the General Matrix Multiply (GEMM) benchmark, ARIES achieves 4.92 TFLOPS, 15.86 TOPS, and 45.94 TOPS throughput under FP32, INT16, and, INT8 data types on Versal VCK190 respectively. Compared with the state-of-the-art (SOTA) work CHARM for AIE, ARIES improves the throughput by 1.17x, 1.59x, and 1.47x correspondingly. For ResNet residual layer, ARIES achieves up to 22.58x speedup compared with optimized SOTA work Riallto on Ryzen-AI NPU. ARIES is open-sourced on GitHub: https://github.com/arc-research-lab/Aries.
more » « less
Free, publicly-accessible full text available February 27, 2026
PIMsynth: A Unified Compiler Framework for Bit-Serial Processing-In-Memory Architectures

https://doi.org/10.1109/LCA.2025.3600588

Guo, Deyuan; Gholamrezaei, Mohammadhosein; Hofmann, Matthew; Venkat, Ashish; Zhang, Zhiru; Skadron, Kevin (January 2025, IEEE Computer Architecture Letters)

Free, publicly-accessible full text available January 1, 2026
Rapid GPU-Based Pangenome Graph Layout

https://doi.org/10.1109/SC41406.2024.00035

Li, Jiajie; Schmelzle, Jan-Niklas; Du, Yixiao; Heumos, Simon; Guarracino, Andrea; Guidi, Giulia; Prins, Pjotr; Garrison, Erik; Zhang, Zhiru (November 2024, IEEE)

Free, publicly-accessible full text available November 17, 2025
Differentiable Combinatorial Scheduling at Scale

Liu, Mingju; Li, Yingjie; Yin, Jiaqi; Zhang, Zhiru; Yu, Cunxi (July 2024, International Conference on Machine Learning (ICML))

Full Text Available
Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs

Dotzel, Jordan; Chen, Yuzong; Kotb, Bahaa; Prasad, Sushma; Wu, Gang; Li, Sheng; Abdelfattah, Mohamed S; Zhang, Zhiru (September 2024, Openreview)

Full Text Available
Differentiable Combinatorial Scheduling at Scale

Liu, Mingju; Li, Yingjie; Yin, Jiaqi; Zhang, Zhiru; Yu, Cunxi (July 2024, Proceedings of the 41st International Conference on Machine Learning (ICML))

Full Text Available
Polynormer: Polynomial-Expressive Graph Transformer in Linear Time

Deng, Chenhui; Yue, Zichao; Zhang, Zhiru (May 2024, International Conference on Learning Representations (ICLR))

Full Text Available

« Prev Next »

Search for: All records