DRoP: Distributionally Robust Data Pruning

Vysogorets, Artem; Ahuja, Kartik; Kempe, Julia

Citation Details

This content will become publicly available on April 24, 2026

DRoP: Distributionally Robust Data Pruning

In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We present theoretical analysis of the classification risk in a mixture of Gaussians to argue that choosing appropriate class pruning ratios, coupled with random pruning within classes has potential to improve worst-class performance. We thus propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving distributional robustness at a tolerable drop of average performance as we prune more from the datasets. more »

Award ID(s):: 1922658

PAR ID:: 10649840

Author(s) / Creator(s):: Vysogorets, Artem; Ahuja, Kartik; Kempe, Julia

Publisher / Repository:: The International Conference on Learning Representations (ICLR 2025)

Date Published:: 2025-04-24

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on April 24, 2026
Conference Paper:
The DOI is not currently available.

More Like this