Youmu: Efficient Columnar Data Pipeline for LLM Training

Zhong, Tianle; Zhao, Jiechen; Su, Qiang; Fox, Geoffrey

Citation Details

This content will become publicly available on February 11, 2026

Youmu: Efficient Columnar Data Pipeline for LLM Training

Large language models (LLMs) training is extremely data-intensive, often involving over trillion-level tokens. Although LLM datasets are usually ingested and stored in columnar formats, they often need to be converted into another format for training, which incurs significant storage and maintenance costs due to extra data copies. While eliminating the conversion would save tens of terabytes of space in costly high performance storage, this work identifies challenges that drive us to re-think the entire data pipeline. Without conversion, we find that fine-grained random access patterns incur hundreds of times efficiency drops. Specifically, the existing data pipelines have two fundamental drawbacks: (1) They cannot efficiently support directly digesting data in columnar format due to default coarse-grained I/O; (2) Solutions to the first drawback sacrifice memory footprint to cache datasets. In this paper, we present Youmu, a new data pipeline that directly feeds fine-grained columnar data into GPUs, enabling cost-efficient LLM training. Meanwhile, Youmu maintains high training accuracy, whose perplexity outperforms widely adopted local shuffle by reducing 0.3-0.7 for pretraining. Compared to performance-optimal state-of-the-art, distributed memory-based pipelines, Youmu achieves comparable throughput with 80% less memory footprint. more »

Award ID(s):: 2411009

PAR ID:: 10634827

Author(s) / Creator(s):: Zhong, Tianle; Zhao, Jiechen; Su, Qiang; Fox, Geoffrey

Publisher / Repository:: https://openreview.net/forum?id=I2LF8QHaua

Date Published:: 2025-02-11

Format(s):: Medium: X

Location:: Eighth Conference on Machine Learning and Systems MLSys 2025 Santa Clara

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on February 11, 2026
Conference Paper:
The DOI is not currently available.

More Like this