- Home
- Search Results
- Page 1 of 1
Search for: All records
-
Total Resources4
- Resource Type
-
0004000000000000
- More
- Availability
-
40
- Author / Contributor
- Filter by Author / Creator
-
-
Oliaro, Gabriele (4)
-
Antichi, Gianni (2)
-
Cheng, Xinhao (2)
-
Jia, Zhihao (2)
-
Langlet, Jonatan (2)
-
Miao, Xupeng (2)
-
Mitzenmacher, Michael (2)
-
Yu, Minlan (2)
-
Abhyankar, Reyna (1)
-
Arfeen, Daiyaan (1)
-
Ben-Basat, Ran (1)
-
Ben_Basat, Ran (1)
-
Chen, Zhuoming (1)
-
Delacourt, Remi (1)
-
Gao, Ruohan (1)
-
Huang, Yingyi (1)
-
Kada, Vineeth (1)
-
Ramanathan, Sivaramakrishnan (1)
-
Shi, Chunan (1)
-
Shi, Xiaoxiang (1)
-
- Filter by Editor
-
-
& Spizer, S. M. (0)
-
& . Spizer, S. (0)
-
& Ahn, J. (0)
-
& Bateiha, S. (0)
-
& Bosch, N. (0)
-
& Brennan K. (0)
-
& Brennan, K. (0)
-
& Chen, B. (0)
-
& Chen, Bodong (0)
-
& Drown, S. (0)
-
& Ferretti, F. (0)
-
& Higgins, A. (0)
-
& J. Peters (0)
-
& Kali, Y. (0)
-
& Ruiz-Arias, P.M. (0)
-
& S. Spitzer (0)
-
& Sahin. I. (0)
-
& Spitzer, S. (0)
-
& Spitzer, S.M. (0)
-
(submitted - in Review for IEEE ICASSP-2024) (0)
-
-
Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Finetuning large language models (LLMs) is essential for task adaptation, yet today’s serving stacks isolate inference and finetuning on separate GPU clusters—wasting resources and under-utilizing hardware. We introduce FlexLLM, the first system to co-serve LLM inference and PEFT-based finetuning on shared GPUs by fusing computation at the token level. FlexLLM's static compilation optimizations—dependent parallelization and graph pruning significantly shrink activation memory, leading to end-to-end GPU memory savings by up to 80%. At runtime, a novel token-level finetuning mechanism paired with a hybrid token scheduler dynamically interleaves inference and training tokens within each co-serving iteration, meeting strict latency SLOs while maximizing utilization. In end-to-end benchmarks on LLaMA-3.1-8B, Qwen-2.5-14B, and Qwen-2.5-32B, FlexLLM maintains inference SLO compliance at up to 20 req/s, and improves finetuning throughput by 1.9-4.8x under heavy inference workloads and 2.5-6.8x under light loads, preserving over 76% of peak finetuning progress even at peak demand.more » « less
-
Langlet, Jonatan; Ben_Basat, Ran; Oliaro, Gabriele; Mitzenmacher, Michael; Yu, Minlan; Antichi, Gianni (, Proceedings of ACM SIGCOMM)Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence, it is progressively hard to scale data collection systems. We introduce Direct Telemetry Access (DTA), a solution optimized for aggregating and moving hundreds of millions of reports per second from switches into queryable data structures in collectors' memory. DTA is lightweight and it is able to greatly reduce overheads at collectors. DTA is built on top of RDMA, and we propose novel and expressive reporting primitives to allow easy integration with existing state-of-the-art telemetry mechanisms such as INT or Marple. We show that DTA significantly improves telemetry collection rates. For example, when used with INT, it can collect and aggregate over 400M reports per second with a single server, improving over the Atomic MultiLog by up to 16x.more » « less
-
Miao, Xupeng; Oliaro, Gabriele; Zhang, Zhihao; Cheng, Xinhao; Wang, Zeyu; Zhang, Zhengxin; Wong, Rae_Ying Yee; Zhu, Alan; Yang, Lijie; Shi, Xiaoxiang; et al (, ACM)
-
Langlet, Jonatan; Ben-Basat, Ran; Ramanathan, Sivaramakrishnan; Oliaro, Gabriele; Mitzenmacher, Michael; Yu, Minlan; Antichi, Gianni (, HotNets '21: Proceedings of the Twentieth ACM Workshop on Hot Topics in Networks)
An official website of the United States government

Full Text Available