NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

ECLIP: Energy-efficient and Practical Co-Location of ML Inference on Spatially Partitioned GPUs

Quach, Ryan; Wang, Yidi; Jahanshahi, Ali; Wong, Daniel; Kim, Hyoseung (August 2025, IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED))

As AI inference becomes mainstream, research has begun to focus on improving the energy consumption of inference servers. Inference kernels commonly underutilize a GPU’s compute resources and waste power from idling components. To improve utilization and energy efficiency, multiple models can co-locate and share the GPU. However, typical GPU spatial partitioning techniques often experience significant overheads when reconfiguring spatial partitions, which can waste additional energy through repartitioning overheads or non-optimal partition configurations. In this paper, we present ECLIP, a framework to enable low-overhead energy-efficient kernel-wise resource partitioning between co-located inference kernels. ECLIP minimizes repartitioning overheads by pre-allocating pools of CU masked streams and assigns optimal CU assignments to groups of kernels through our resource allocation optimizer. Overall, ECLIP achieves an average of 13% improvement to throughput and 25% improvement to energy efficiency.
more » « less
Free, publicly-accessible full text available August 6, 2026
PCCL: Energy-Efficient LLM Training with Power-Aware Collective Communication

https://doi.org/10.1109/ICCD63220.2024.00023

Jia, Ziyang; Bhuyan, Laxmi N; Wong, Daniel (November 2024, IEEE)

Full Text Available
Geographical Server Relocation: Opportunities and Challenges

Liu, Yejia; Li, Pengfei; Wong, Daniel; Ren, Shaolei (July 2024, 2024 HotCarbon Workshop on Sustainable Computer Systems)

The enormous growth of AI computing has led to a surging demand for electricity. To stem the resulting energy cost and environmental impact, this paper explores opportunities enabled by the increasing hardware heterogeneity and introduces the concept of Geographical Server Relocation (GSR). Specifically, GSR physically balances the available AI servers across geographically distributed data centers subject to AI computing demand and power capacity constraints in each location. The key idea of GSR is to relocate older and less energy-efficient servers to regions with more renewables, better water efficiencies and/or lower electricity prices. Our case study demonstrates that, even with modest flexibility of relocation, GSR can substantially reduce the total operational environmental footprints and operation costs of AI computing. We conclude this paper by discussing major challenges of GSR, including service migration, software management, and algorithms.
more » « less
Full Text Available
Towards Explainable Monaural Speaker Separation with Auditory-based Training

Taherian, Hassan; Kalkhorani, Vahid Ahmadi; Pandey, Ashutosh; Wong, Daniel; Xu, Buye; Wang, DeLiang (September 2024, International Speech Communication Association)

Full Text Available
Characterizing In-Kernel Observability of Latency-Sensitive Request-Level Metrics with eBPF

https://doi.org/10.1109/ISPASS61541.2024.00013

Rezvani, Mohammadreza; Jahanshahi, Ali; Wong, Daniel (May 2024, IEEE)

Full Text Available
Leveraging Sound Localization to Improve Continuous Speaker Separation

https://doi.org/10.1109/ICASSP48485.2024.10446934

Taherian, Hassan; Pandey, Ashutosh; Wong, Daniel; Xu, Buye; Wang, DeLiang (April 2024, IEEE)

Continuous speaker separation aims to separate overlapping speakers in real-world environments like meetings, but it often falls short in isolating speech segments of a single speaker. This leads to split signals that adversely affect downstream applications such as automatic speech recognition and speaker diarization. Existing solutions like speaker counting have limitations. This paper presents a novel multi-channel approach for continuous speaker separation based on multi-input multi-output (MIMO) complex spectral mapping. This MIMO approach enables robust speaker localization by preserving inter-channel phase relations. Speaker localization as a byproduct of the MIMO separation model is then used to identify single-talker frames and reduce speaker splitting. We demonstrate that this approach achieves superior frame-level sound localization. Systematic experiments on the LibriCSS dataset further show that the proposed approach outperforms other methods, advancing state-of-the-art speaker separation performance.
more » « less
Full Text Available
GCAPS: GPU Context-Aware Preemptive Priority-Based Scheduling for Real-Time Tasks

https://doi.org/10.4230/LIPIcs.ECRTS.2024.14

Wang, Yidi; Liu, Cong; Wong, Daniel; Kim, Hyoseung (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Pellizzoni, Rodolfo (Ed.)
Scheduling real-time tasks that utilize GPUs with analyzable guarantees poses a significant challenge due to the intricate interaction between CPU and GPU resources, as well as the complex GPU hardware and software stack. While much research has been conducted in the real-time research community, several limitations persist, including the absence or limited availability of GPU-level preemption, extended blocking times, and/or the need for extensive modifications to program code. In this paper, we propose GCAPS, a GPU Context-Aware Preemptive Scheduling approach for real-time GPU tasks. Our approach exerts control over GPU context scheduling at the device driver level and enables preemption of GPU execution based on task priorities by simply adding one-line macros to GPU segment boundaries. In addition, we provide a comprehensive response time analysis of GPU-using tasks for both our proposed approach as well as the default Nvidia GPU driver scheduling that follows a work-conserving round-robin policy. Through empirical evaluations and case studies, we demonstrate the effectiveness of the proposed approaches in improving taskset schedulability and response time. The results highlight significant improvements over prior work as well as the default scheduling approach, with up to 40% higher schedulability, while also achieving predictable worst-case behavior on Nvidia Jetson embedded platforms.
more » « less
Full Text Available
WattWiser: Power Resource-Efficient Scheduling for Multi-Model Multi-GPU Inference Servers

Jahanshahi, Ali; Rezvani, Mohammadreza; Wong, Daniel (October 2023, 2023 IEEE 14th International Green and Sustainable Computing Conference (IGSC))

Full Text Available
Genome-scale resources in the infant gut symbiont Bifidobacterium breve reveal genetic determinants of colonization and host-microbe interactions

https://doi.org/10.1016/j.cell.2025.02.010

Shiver, Anthony L; Sun, Jiawei; Culver, Rebecca; Violette, Arvie; Wynter, Char; Nieckarz, Marta; Mattiello, Samara Paula; Sekhon, Prabhjot Kaur; Bottacini, Francesca; Friess, Lisa; et al (April 2025, Cell)

Bifidobacteria represent a dominant constituent of human gut microbiomes during infancy, influencing nutrition, immune development, and resistance to infection. Despite interest in bifidobacteria as a live biotic therapy, our understanding of colonization, host-microbe interactions, and the health-promoting effects of bifidobacteria is limited. To address these major knowledge gaps, we used a large-scale genetic approach to create a mutant fitness compendium in Bifidobacterium breve. First, we generated a high-density randomly barcoded transposon insertion pool and used it to determine fitness requirements during colonization of germ-free mice and chickens with multiple diets and in response to hundreds of in vitro perturbations. Second, to enable mechanistic investigation, we constructed an ordered collection of insertion strains covering 1,462 genes. We leveraged these tools to reveal community- and diet-specific requirements for colonization and to connect the production of immunomodulatory molecules to growth benefits. These resources will catalyze future investigations of this important beneficial microbe.
more » « less
Free, publicly-accessible full text available April 1, 2026
Baleen: ML Admission & Prefetching for Flash Caches

Wong, Daniel Lin-Kit; Wu, Hao; Molder, Carson; Gunasekar, Sathya; Lu, Sathya; Khandkar, Snehal; Sharma, Abhinav; Berger, Daniel S; Beckmann, Nathan; Ganger, Gregory R (February 2024, Usenix)

Flash caches are used to reduce peak backend load for throughput-constrained data center services, reducing the total number of backend servers required. Bulk storage systems are a large-scale example, backed by high-capacity but low-throughput hard disks, and using flash caches to provide a more cost-effective storage layer underlying everything from blobstores to data warehouses. However, flash caches must address the limited write endurance of flash by limiting the long-term average flash write rate to avoid premature wearout. To do so, most flash caches must use admission policies to filter cache insertions and maximize the workload-reduction value of each flash write. The Baleen flash cache uses coordinated ML admission and prefetching to reduce peak backend load. After learning painful lessons with our early ML policy attempts, we exploit a new cache residency model (which we call episodes) to guide model training. We focus on optimizing for an end-to-end system metric (Disk-head Time) that measures backend load more accurately than IO miss rate or byte miss rate. Evaluation using Meta traces from seven storage clusters shows that Baleen reduces Peak Disk-head Time (and hence the number of backend hard disks required) by 12% over state-of-the-art policies for a fixed flash write rate constraint. Baleen-TCO, which chooses an optimal flash write rate, reduces our estimated total cost of ownership (TCO) by 17%. Code and traces are available at https://www.pdl.cmu.edu/CILES/.
more » « less
Full Text Available

« Prev Next »

Search for: All records