NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

DataComp-LM: In search of the next generation of training sets for language models

Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; et al (April 2025, https://doi.org/10.48550/arXiv.2406.11794)

The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.
more » « less
Free, publicly-accessible full text available April 21, 2026
Serverless Straggler Mitigation using Error-Correcting Codes

https://doi.org/10.1109/ICDCS47774.2020.00019

Gupta, Vipul; Carrano, Dominic; Yang, Yaoqing; Shankar, Vaishaal; Courtade, Thomas; Ramchandran, Kannan (November 2020, 2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS))
null (Ed.)
Full Text Available
Serverless linear algebra

https://doi.org/10.1145/3419111.3421287

Shankar, Vaishaal; Krauth, Karl; Vodrahalli, Kailas; Pu, Qifan; Recht, Benjamin; Stoica, Ion; Ragan-Kelley, Jonathan; Jonas, Eric; Venkataraman, Shivaram (October 2020, SoCC '20: Proceedings of the 11th ACM Symposium on Cloud Computing)
null (Ed.)
Datacenter disaggregation provides numerous benefits to both the datacenter operator and the application designer. However switching from the server-centric model to a disaggregated model requires developing new programming abstractions that can achieve high performance while benefiting from the greater elasticity. To explore the limits of datacenter disaggregation, we study an application area that near-maximally benefits from current server-centric datacenters: dense linear algebra. We build NumPyWren, a system for linear algebra built on a disaggregated serverless programming model, and LAmbdaPACK, a companion domain-specific language designed for serverless execution of highly parallel linear algebra algorithms. We show that, for a number of linear algebra algorithms such as matrix multiply, singular value decomposition, Cholesky decomposition, and QR decomposition, NumPyWren's performance (completion time) is within a factor of 2 of optimized server-centric MPI implementations, and has up to 15% greater compute efficiency (total CPU-hours), while providing fault tolerance.
more » « less
Full Text Available
A Meta-Analysis of Overfitting in Machine Learning

Roelofs, Rebecca; Shankar, Vaishaal; Recht, Benjamin; Fridovich-Keil, Sara; Hardt, Moritz; Miller, John; Schmidt, Ludwig (January 2019, Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019)

Full Text Available

Search for: All records