NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

COpter: Efficient Large-Scale Resource-Allocation via Continual Optimization

https://doi.org/10.1145/3731569.3764846

Subramanya, Suhas Jayaram; Dennis, Don Kurian; Smith, Virginia; Ganger, Gregory R (October 2025, ACM)

Free, publicly-accessible full text available October 12, 2026
Mirage: A Multi-Level Superoptimizer for Tensor Programs

Wu, Mengd; Cheng, Xinhao; Liu, Shengyu; Shi, Chunan; Ji, Jianan; Ao, Man Kit; Velliengiri, Praveen; Miao, Xupeng; Padon, Oded; Jia, Zhihao (July 2025, USENIX)

Free, publicly-accessible full text available July 7, 2026
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Zihao; Chen, Lequn; Lai, Ruihang; Lin, Wuwei; Zhang, Yineng; Wang, Stephanie; Chen, Tianqi; Kasikci, Baris; Grover, Vinod; Krishnamurthy, Arvind; et al (May 2025, MLsys 2025)

Free, publicly-accessible full text available May 12, 2026
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

Ye, Zihao; Chen, Lequn; Lai, Ruihang; Lin, Wuwei; Zhang, Yineng; Wang, Stephanie; Chen, Tianqi; Kasikci, Baris; Grover, Vinod; Krishnamurthy, Arvind; et al (May 2025, MLSys)

Free, publicly-accessible full text available May 12, 2026
H4H: Hybrid Convolution-Transformer Architecture Search for NPU-CIM Heterogeneous Systems for AR/VR Applications

Zhao, Yiwei; Li, Ziyun; Khwa, Win-San; Sun, Xiaoyu; Zhang, Sai Qian; Sarwar, Syed Shakib; Stangherlin, Kleber Hugo; Lu, Yi-Lun; Gomez, Jorge Tomas; Seo, Jae-sun; et al (January 2025, Proceedings of the ASPDAC Asia and South Pacific Design Automation Conference)

Low-latency and low-power edge AI is crucial for Virtual Reality and Augmented Reality applications. Recent advances demonstrate that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve a superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can present system challenges for latency and energy efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and explore diverse execution schemas to efficiently execute these hybrid models. We introduce H4H-NAS, a two-stage Neural Architecture Search (NAS) framework to automate the design of efficient hybrid CNN/ViT models for heterogeneous edge systems featuring both NPU and CIM. We propose a two-phase incremental supernet training in our NAS framework to resolve gradient conflicts between sampled subnets caused by different types of blocks in a hybrid model search space. Our H4H-NAS approach is also powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN-ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet. Moreover, results from our algorithm/hardware co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing heterogeneous computing over baseline solutions. Overall, our framework guides the design of hybrid network architectures and system architectures for NPU+CIM heterogeneous systems.
more » « less
Free, publicly-accessible full text available January 20, 2026
Tartan: Microarchitecting a Robotic Processor

https://doi.org/10.1109/ISCA59077.2024.00047

Bakhshalipour, Mohammad; Gibbons, Phillip B (June 2024, IEEE)

Full Text Available
Atom: Low-Bit Quantization for Efficient and Accurate LLM Serving

Zhao, Yilong; Lin, Chien-Yu; Zhu, Kan; Ye, Zihao; Chen, Lequn; Zheng, Size; Ceze, Luis; Krishnamurthy, Arvind; Chen, Tianqi; Kasikci, Baris (May 2024, Proceedings of Machine Learning and Systems 2024)

Full Text Available
ACROBAT: Optimizing Auto-batching of Dynamic Deep Learning at Compile Time

Fegade, Pratik; Chen, Tianqi; Gibbons, Phillip; Mowry, Todd (May 2024, Proceedings of Machine Learning and Systems 2024)

Full Text Available
SpotServe: Serving Generative Large Language Models on Preemptible Instances

https://doi.org/10.1145/3620665.3640411

Miao, Xupeng; Shi, Chunan; Duan, Jiangfei; Xi, Xiaoli; Lin, Dahua; Cui, Bin; Jia, Zhihao (April 2024, ACM)

Full Text Available
SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification

https://doi.org/10.1145/3620666.3651335

Miao, Xupeng; Oliaro, Gabriele; Zhang, Zhihao; Cheng, Xinhao; Wang, Zeyu; Zhang, Zhengxin; Wong, Rae_Ying Yee; Zhu, Alan; Yang, Lijie; Shi, Xiaoxiang; et al (April 2024, ACM)

Full Text Available

« Prev Next »

Search for: All records