NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Retraining with Predicted Hard Labels Provably Increases Model Accuracy

Das, Rudrajit; Dhillon, Inderjit_S; Epasto, Alessandro; Javanmard, Adel; Mao, Jieming; Mirrokni, Vahab; Sanghavi, Sujay; Zhong, Peilin (May 2025, https://doi.org/10.48550/arXiv.2406.11206)

Training with noisy labels often yields suboptimal performance, but retraining a model with its own predicted hard labels (binary 1/0 outputs) has been empirically shown to improve accuracy. This paper provides the first theoretical characterization of this phenomenon. In the setting of linearly separable binary classification with randomly corrupted labels, the authors prove that retraining can indeed improve the population accuracy compared to initial training with noisy labels. Retraining also has practical implications for local label differential privacy (DP), where models are trained with noisy labels. The authors propose consensus-based retraining, where retraining is done selectively on samples for which the predicted label matches the given noisy label. This approach significantly improves DP training accuracy at no additional privacy cost. For example, training ResNet-18 on CIFAR-100 with ε = 3 label DP achieves over 6% accuracy improvement with consensus-based retraining.
more » « less
Free, publicly-accessible full text available May 7, 2026
Infilling Score: A Pretraining Data Detection Algorithm for Large Language Models

Raoof, Negin; Rout, Litu; Daras, Giannis; Sanghavi, Sujay; Caramanis, Constantine; Shakkottai, Sanjay; Dimakis, Alex (March 2025, ICLR 2025)

In pretraining data detection, the goal is to detect whether a given sentence is in the dataset used for training a Large Language Model LLM). Recent methods (such as Min-K % and Min-K%++) reveal that most training corpora are likely contaminated with both sensitive content and evaluation benchmarks, leading to inflated test set performance. These methods sometimes fail to detect samples from the pretraining data, primarily because they depend on statistics composed of causal token likelihoods. We introduce Infilling Score, a new test-statistic based on non-causal token likelihoods. Infilling Score can be computed for autoregressive models without re-training using Bayes rule. A naive application of Bayes rule scales linearly with the vocabulary size. However, we propose a ratio test-statistic whose computation is invariant to vocabulary size. Empirically, our method achieves a significant accuracy gain over state-of-the-art methods including Min-K%, and Min-K%++ on the WikiMIA benchmark across seven models with different parameter sizes. Further, we achieve higher AUC compared to reference-free methods on the challenging MIMIR benchmark. Finally, we create a benchmark dataset consisting of recent data sources published after the release of Llama-3; this benchmark provides a statistical baseline to indicate potential corpora used for Llama-3 training.
more » « less
Free, publicly-accessible full text available March 26, 2026
Adaptive and Optimal Second-order Optimistic Methods for Minimax Optimization

Jiang, Ruichen; Kavis, Ali; Jin, Qiujiang; Sanghavi, Sujay; Mokhtari, Aryan (December 2024, NeurIPS proceedings)

Full Text Available
DataComp-LM: In search of the next generation of training sets for language models

Li, Jeffrey; Fang, Alex; Smyrnis, Georgios; Ivgi, Maor; Jordan, Matt; Gadre, Samir; Bansal, Hritik; Guha, Etash; Keh, Sedrick; Arora, Kushal; et al (April 2025, https://doi.org/10.48550/arXiv.2406.11794)

The authors introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments aimed at improving language models. DCLM provides a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants can experiment with dataset curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline, the authors find that model-based filtering is critical for assembling a high-quality training set. Their resulting dataset, DCLM-Baseline, enables training a 7B parameter model from scratch to achieve 64% 5-shot accuracy on MMLU with 2.6T training tokens. This represents a 6.6 percentage point improvement over MAP-Neo (the previous state-of-the-art in open-data LMs), while using 40% less compute. The baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% and 66%), and performs similarly on an average of 53 NLU tasks, while using 6.6x less compute than Llama 3 8B. These findings emphasize the importance of dataset design for training LMs and establish a foundation for further research on data curation.
more » « less
Free, publicly-accessible full text available April 21, 2026
On the Benefits of Multiple Gossip Steps in Communication-Constrained Decentralized Federated Learning

https://doi.org/10.1109/TPDS.2021.3138977

Hashemi, Abolfazl; Acharya, Anish; Das, Rudrajit; Vikalo, Haris; Sanghavi, Sujay; Dhillon, Inderjit S. (November 2022, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
Asymptotically-Optimal Gaussian Bandits with Side Observations

Atsidakou, Alexia; Papadigenopoulos, Orestis; Caramanis, Constantine; Sanghavi, Sujay; Shakkottai, Sanjay (January 2022, Proceedings of the 39th International Conference on Machine Learning)

Full Text Available
Nearly Horizon-Free Offline Reinforcement Learning

Ren, Tongzheng; Li, Jianlian; Dai, Bo; Du, Simon S; Sanghavi, Sujay (January 2021, Conference on Neural Information Processing Systems)

Full Text Available
Learning Distributions Generated by One-Layer ReLU Networks

Wu, Shanshan; Dimakis, A.G.; Sanghavi, Sujay (December 2019, Advances in neural information processing systems)

Full Text Available
Blocking Bandits

Basu, Soumya; Sen, Rajat; Sanghavi, Sujay; Shakkottai, Sanjay (January 2019, 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.)

Full Text Available
Learning a Compressed Sensing Measurement Matrix via Gradient Unrolling

Wu, Shanshan; Dimakis, Alexandros; Sanghavi, Sujay; Yu, Felix; Holtmann-Rice, Daniel; Storcheus, Dmitry; Rostamizadeh, Afshin; Kumar, Sanjiv (June 2019, International Conference on Machine Learning (ICML))

Linear encoding of sparse vectors is widely popular, but is commonly data-independent – missing any possible extra (but a priori unknown) structure beyond sparsity. In this paper we present a new method to learn linear encoders that adapt to data, while still performing well with the widely used l1 decoder. The convex l1 decoder prevents gradient propagation as needed in standard gradient-based training. Our method is based on the insight that unrolling the convex decoder into T projected subgradient steps can address this issue. Our method can be seen as a data-driven way to learn a compressed sensing measurement matrix. We compare the empirical performance of 10 algorithms over 6 sparse datasets (3 synthetic and 3 real). Our experiments show that there is indeed additional structure beyond sparsity in the real datasets; our method is able to discover it and exploit it to create excellent reconstructions with fewer measurements (by a factor of 1.1-3x) compared to the previous state-of-the-art methods. We illustrate an application of our method in learning label embeddings for extreme multi-label classification, and empirically show that our method is able to match or outperform the precision scores of SLEEC, which is one of the state-of-the-art embedding-based approaches.
more » « less
Full Text Available

« Prev Next »

Search for: All records