Flextron: Many-in-One Flexible Large Language Model

Cai, Ruisi; Muralidharan, Saurav; Heinrich, Greg; Yin, Hongxu; Wang, Zhangyang; Kautz, Jan; Molchanov, Pavlo

Citation Details

raining modern large language models (LLMs) is extremely resource-intensive, and repeatedly customizing them for deployment scenarios with limited compute and memory is impractical. This paper introduces Flextron, a network architecture and post-training model optimization framework that supports flexible model deployment. Flextron uses a nested elastic structure that adapts rapidly to user-defined latency and accuracy targets during inference without requiring additional fine-tuning. It is also input-adaptive, automatically routing tokens through sub-networks for improved efficiency and performance. The authors propose a sample-efficient training method and routing algorithms to systematically transform an already-trained LLM into a Flextron model. Evaluation on the GPT-3 and LLaMA-2 families demonstrates Flextron’s superior performance over end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes only 7.63% of the tokens compared to original pretraining. more »

Award ID(s):: 2505865

PAR ID:: 10631936

Author(s) / Creator(s):: Cai, Ruisi; Muralidharan, Saurav; Heinrich, Greg; Yin, Hongxu; Wang, Zhangyang; Kautz, Jan; Molchanov, Pavlo

Publisher / Repository:: https://doi.org/10.48550/arXiv.2406.10260

Date Published:: 2024-08-28

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this