NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

An Empirical Study of Microscaling Formats for Low-Precision LLM Training

Yang, Hanmei; Deng, Summer; Nagpal, Amit; Naumov, Maxim; Janani, Mohammad; Liu, Tongping; Guan, Hui (April 2025, 2025 IEEE 32nd Symposium on Computer Arithmetic (ARITH))

Free, publicly-accessible full text available April 17, 2026
An Empirical Study of Microscaling Formats for Low-Precision LLM Training

Yang, Hanmei; Deng, Summer; Nagpal, Amit; Naumov, Maxim; Janani, Mohammad; Liu, Tongping; Guan, Hui (April 2025, 2025 IEEE 32nd Symposium on Computer Arithmetic (ARITH))

Free, publicly-accessible full text available April 17, 2026
AdapMTL: Adaptive Pruning Framework for Multitask Learning Model

Xiang, Mingcan; Tang, Jiaxun; Yang, Qizheng; Guan, Hui; Liu, Tongping (October 2024, 2024 ACM Multimedia)

Full Text Available
Understanding and Alleviating Memory Consumption in RLHF for LLMs

Zhou, Jin; Yang, Hanmei; Tang, Steven; Xiang, Mingcan; Guan, Hui; Liu, Tongping (October 2024, Machine Learning for Systems Workshop at (NeurIPS 2024).)

Full Text Available
Understanding and Alleviating Memory Consumption in RLHF for LLMs

Zhou, Jin; Yang, Hanmei; Tang, Steven; Xiang, Mingcan; Guan, Hui; Liu, Tongping (October 2024, Machine Learning for Systems Workshop at (NeurIPS 2024).)

Full Text Available
Scaler: Efficient and Effective Cross Flow Analysis

https://doi.org/10.1145/3691620.3695473

Tang, Steven Jiaxun; Xiang, Mingcan; Wang, Yang; Wu, Bo; Chen, Jianjun; Liu, Tongping (October 2024, ACM)

Full Text Available
Improving Resource and Energy Efficiency for Cloud 3D through Excessive Rendering Reduction

https://doi.org/10.1145/3627703.3650064

Liu, Tianyi; Lucas, Jerry; He, Sen; Liu, Tongping; Wang, Xiaoyin; Wang, Wei (April 2024, EuroSys '24: Proceedings of the Nineteenth European Conference on Computer Systems)
NUMAlloc: A Faster NUMA Memory Allocator

https://doi.org/10.1145/3591195.3595276

Yang, Hanmei; Zhao, Xin; Zhou, Jin; Wang, Wei; Kundu, Sandip; Wu, Bo; Liu Tongping (June 2023, ACM SIGPLAN International Symposium on Memory Management)

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the coop- eration of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allo- cators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc , that is de- signed for the NUMA architecture. NUMAlloc is centered on a binding-based memory management. On top of it, NUMAl- loc proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evalua- tion, NUMAlloc hasthebestperformanceamongallevaluated allocators, running 15.7% faster than the second-best allo- cator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.
more » « less
Full Text Available
NUMAlloc: A Faster NUMA Memory Allocator

https://doi.org/10.1145/3591195.3595276

Yang, Hanmei; Zhao, Xin; Zhou, Jin; Wang, Wei; Kundu, Sandip; Wu, Bo; Guan, Hui; Liu, Tongping (June 2023, ACM)

The NUMA architecture accommodates the hardware trend of an increasing number of CPU cores. It requires the cooperation of memory allocators to achieve good performance for multithreaded applications. Unfortunately, existing allocators do not support NUMA architecture well. This paper presents a novel memory allocator – NUMAlloc, that is designed for the NUMA architecture. is centered on a binding-based memory management. On top of it, proposes an “origin-aware memory management” to ensure the locality of memory allocations and deallocations, as well as a method called “incremental sharing” to balance the performance benefits and memory overhead of using transparent huge pages. According to our extensive evaluation, NUMAlloc has the best performance among all evaluated allocators, running 15.7% faster than the second-best allocator (mimalloc), and 20.9% faster than the default Linux allocator with reasonable memory overhead. NUMAlloc is also scalable to 128 threads and is ready for deployment.
more » « less
Deadlock prediction via generalized dependency

https://doi.org/10.1145/3533767.3534377

Zhou, Jinpeng; Yang, Hanmei; Lange, John; Liu, Tongping (July 2022, Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis)

Deadlocks are notorious bugs in multithreaded programs, causing serious reliability issues. However, they are difficult to be fully expunged before deployment, as their appearances typically depend on specific inputs and thread schedules, which require the assistance of dynamic tools. However, existing deadlock detection tools mainly focus on locks, but cannot detect deadlocks related to condition variables. This paper presents a novel approach to fill this gap. It extends the classic lock dependency to generalized dependency by abstracting the signal for the condition variable as a special resource so that communication deadlocks can be modeled as hold-and-wait cycles as well. It further designs multiple practical mechanisms to record and analyze generalized dependencies. In the end, this paper presents the implementation of the tool, called UnHang. Experimental results on real applications show that UnHang is able to find all known deadlocks and uncover two new deadlocks. Overall, UnHang only imposes around 3% performance overhead and 8% memory overhead, making it a practical tool for the deployment environment.
more » « less
Full Text Available

« Prev Next »

Search for: All records