Training modern LLMs is extremely resource intensive, and customizing them for various deployment scenarios characterized by limited compute and memory resources through repeated training is impractical. In this paper, we introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment. The Flextron architecture utilizes a nested elastic structure to rapidly adapt to specific user-defined latency and accuracy targets during inference with no additional fine-tuning required. It is also input-adaptive, and can automatically route tokens through its sub-networks for improved performance and efficiency. We present a sample-efficient training method and associated routing algorithms for systematically transforming an existing trained LLM into a Flextron model. We evaluate Flextron on the GPT-3 and LLama-2 family of LLMs, and demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
more »
« less
This content will become publicly available on May 12, 2026
Lumos: Efficient Performance Modeling and Estimation for Large-scale LLM Training
Training LLMs in distributed environments presents significant challenges due to the complexity of model execution, deployment systems, and the vast space of configurable strategies. Although various optimization techniques exist, achieving high efficiency in practice remains difficult. Accurate performance models that effectively characterize and predict a model’s behavior are essential for guiding optimization efforts and system-level studies. We propose Lumos, a trace-driven performance modeling and estimation toolkit for large-scale LLM training, designed to accurately capture and predict the execution behaviors of modern LLMs. We evaluate Lumos on a production ML cluster with up to 512 NVIDIA H100 GPUs using various GPT-3 variants, demonstrating that it can replay execution time with an average error of just 3.3%, along with other runtime details, across different models and configurations. Additionally, we validate its ability to estimate performance for new setups from existing traces, facilitating efficient exploration of model and deployment configurations.
more »
« less
- Award ID(s):
- 2326182
- PAR ID:
- 10598180
- Publisher / Repository:
- MLSys
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The burgeoning sophistication of Artificial Intelligence (AI) has catalyzed the rapid proliferation of Large Language Models (LLMs) within software development. These models are increasingly employed to automate the generation of functionally correct code, address complex computational problems, and facilitate the debugging of existing software systems. However, LLM-generated code often faces challenges due to inherent inefficiencies, including redundant logical structures, factually inconsistent content (hallucinations), and programming errors. To address this issue, our research rigorously evaluated the computational efficiency of Python code generated by three prominent LLMs: GPT-4o-Mini, GPT-3.5-Turbo, and GPT-4-Turbo. The evaluation metrics encompass execution time, memory utilization, and peak memory consumption, while maintaining the functional correctness of the generated code. Leveraging the EffiBench benchmark datasets within the Google Vertex AI Workbench environment, across a spectrum of machine configurations, the study implemented a consistent seed parameter to ensure experimental reproducibility. Furthermore, we investigated the impact of two distinct optimization strategies: Chain-of-Thought (CoT) prompting and model fine-tuning. Our findings reveal a significant enhancement in efficiency metrics for GPT-4o-Mini and GPT-3.5-Turbo when employing CoT prompting; however, this trend was not observed for GPT-4-Turbo. Based on its promising performance with CoT prompting, we selected the GPT-4o-Mini model for subsequent fine-tuning, aiming to further enhance both its computational efficiency and accuracy. However, contrary to our expectations, fine-tuning the GPT-4o-Mini model led to a discernible degradation in both its accuracy and computational efficiency. In conclusion, this study provides empirical evidence suggesting that the deployment of high-CPU machine configurations, in synergy with the utilization of the GPT-4o-Mini model and CoT prompting techniques, yields demonstrably more efficient and accurate LLM-generated Python code, particularly within computationally intensive application scenarios.more » « less
-
raining modern large language models (LLMs) is extremely resource-intensive, and repeatedly customizing them for deployment scenarios with limited compute and memory is impractical. This paper introduces Flextron, a network architecture and post-training model optimization framework that supports flexible model deployment. Flextron uses a nested elastic structure that adapts rapidly to user-defined latency and accuracy targets during inference without requiring additional fine-tuning. It is also input-adaptive, automatically routing tokens through sub-networks for improved efficiency and performance. The authors propose a sample-efficient training method and routing algorithms to systematically transform an already-trained LLM into a Flextron model. Evaluation on the GPT-3 and LLaMA-2 families demonstrates Flextron’s superior performance over end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes only 7.63% of the tokens compared to original pretraining.more » « less
-
With the increasing integration of machine learning into IoT devices, managing energy consumption and data transmission has become a critical challenge. Many IoT applications depend on complex computations performed on server-side infrastructure, necessitating efficient methods to reduce unnecessary data transmission. One promising solution involves deploying compact machine learning models near sensors, enabling intelligent identification and transmission of only relevant data frames. However, existing near-sensor models lack adaptability, as they require extensive pre-training and are often rigidly configured prior to deployment. This paper proposes a novel framework that fuses online learning, active learning, and knowledge distillation to enable adaptive, resource-efficient near-sensor intelligence. Our approach allows near-sensor models to dynamically fine-tune their parameters post-deployment using online learning, eliminating the need for extensive pre-labeling and training. Through a sequential training and execution process, the framework achieves continuous adaptability without prior knowledge of the deployment environment. To enhance performance while preserving model efficiency, we integrate knowledge distillation, enabling the transfer of critical insights from a larger teacher model to a compact student model. Additionally, active learning reduces the required training data while maintaining competitive performance. We validated our framework on both benchmark data from the MS COCO dataset and in a simulated IoT environment. The results demonstrate significant improvements in energy efficiency and data transmission optimization, highlighting the practical applicability of our method in real-world IoT scenarios.more » « less
-
The carbon footprint associated with large language models (LLMs) is a significant concern, encompassing emissions from their training, inference, experimentation, and storage processes, including operational and embodied carbon emissions. An essential aspect is accurately estimating the carbon impact of emerging LLMs even before their training, which heavily relies on GPU usage. Existing studies have reported the carbon footprint of LLM training, but only one tool, mlco2, can predict the carbon footprint of new neural networks prior to physical training. However, mlco2 has several serious limitations. It cannot extend its estimation to dense or mixture-of-experts (MoE) LLMs, disregards critical architectural parameters, focuses solely on GPUs, and cannot model embodied carbon footprints. Addressing these gaps, we introduce \textit{\carb}, an end-to-end carbon footprint projection model designed for both dense and MoE LLMs. Compared to mlco2, \carb~significantly enhances the accuracy of carbon footprint estimations for various LLMs. The source code is released at \url{https://github.com/SotaroKaneda/MLCarbon}.more » « less
An official website of the United States government
