DataComp-LM: In search of the next generation of training sets for language models

Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; Garg, Saurabh; Xin, Rui; Muennighoff, Niklas; Heckel, Reinhard; Mercat, Jean; Chen, Mayee; Gururangan, Suchin; Wortsman, Mitchell; Albalak, Alon; Bitton, Yonatan; Nezhurina, Marianna; Abbas, Amro; Hsieh, Cheng-Yu; Ghosh, Dhruba; Gardner, Josh; Kilian, Maciej; Zhang, Hanlin; Shao, Rulin; Pratt, Sarah; Sanyal, Sunny; Ilharco, Gabriel; Daras, Giannis; Marathe, Kalyani; Gokaslan, Aaron; Zhang, Jieyu; Chandu, Khyathi; Nguyen, Thao; Vasiljevic, Igor; Kakade, Sham; Song, Shuran; Sanghavi, Sujay; Faghri, Fartash; Oh, Sewoong; Zettlemoyer, Luke; Lo, Kyle; El-Nouby, Alaaeldin; Pouransari, Hadi; Toshev, Alexander; Wang, Stephanie; Groeneveld, Dirk; Soldaini, Luca; Koh, Pang_Wei; Jitsev, Jenia; Kollar, Thomas; Dimakis, Alexandros_G; Carmon, Yair; Dave, Achal; Schmidt, Ludwig; Shankar, Vaishaal

Citation Details

The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation. more »

Award ID(s):: 2505865

PAR ID:: 10631930

Author(s) / Creator(s):: Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; Garg, Saurabh; Xin, Rui; Muennighoff, Niklas; Heckel, Reinhard; Mercat, Jean; Chen, Mayee; Gururangan, Suchin; Wortsman, Mitchell; Albalak, Alon; Bitton, Yonatan more » « less

Publisher / Repository:: https://doi.org/10.48550/arXiv.2406.11794

Date Published:: 2025-04-21

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
The DOI is not currently available.

More Like this