In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. At the same time, we argue that random data pruning with appropriate class ratios has potential to improve the worst-class performance. We propose a "fairness-aware" approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving robustness at a tolerable drop of average performance as we prune more from the datasets.
more »
« less
This content will become publicly available on May 6, 2026
Minimizing Data, Maximizing Performance: Generative Examples for Continual Task Learning
Synthetic data is emerging as a powerful tool in computer vision, offering advantages in privacy and security. As generative AI models advance, they enable the creation of large-scale, diverse datasets that eliminate concerns related to sensitive data sharing and costly data collection processes. However, fundamental questions arise: (1) can synthetic data replace natural data in a continual learning (CL) setting? How much synthetic data is sufficient to achieve a desired performance? How well is the network generalizable when trained on synthetic data? To address these questions, we propose a sample minimization strategy for CL that enhances efficiency, generalization, and robustness by selectively removing uninformative or redundant samples during the training phase. We apply this method in a sequence of tasks derived from the GenImage dataset [35]. This setting allows us to compare the impact of training early tasks entirely on synthetic data to analyze how well they transfer knowledge to subsequent tasks or for evaluation on natural images. Furthermore, our method allows us to investigate the impact of removing potentially incorrect, redundant, or harmful training samples. We aim to maximize CL efficiency by removing uninformative images and enhance robustness through adversarial training and data removal. We study how the training order of synthetic and natural data, and what generative models are used, impact CL performance maximization and the natural data minimization. Our findings provide key insights into how generative examples can be used for adaptive, efficient CL in evolving environments.
more »
« less
- PAR ID:
- 10643929
- Publisher / Repository:
- Synthetic Data for Computer Vision Workshop at CVPR 2025
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Contrastive learning (CL), a self-supervised learning approach, can effectively learn visual representations from unlabeled data. Given the CL training data, generative models can be trained to generate synthetic data to supplement the real data. Using both synthetic and real data for CL training has the potential to improve the quality of learned representations. However, synthetic data usually has lower quality than real data, and using synthetic data may not improve CL compared with using real data. To tackle this problem, we propose a data generation framework with two methods to improve CL training by joint sample generation and contrastive learning. The first approach generates hard samples for the main model. The generator is jointly learned with the main model to dynamically customize hard samples based on the training state of the main model. Besides, a pair of data generators are proposed to generate similar but distinct samples as positive pairs. In joint learning, the hardness of a positive pair is progressively increased by decreasing their similarity. Experimental results on multiple datasets show superior accuracy and data efficiency of the proposed data generation methods applied to CL. For example, about 4.0%, 3.5%, and 2.6% accuracy improvements for linear classification are observed on ImageNet-100, CIFAR-100, and CIFAR-10, respectively. Besides, up to 2× data efficiency for linear classification and up to 5× data efficiency for transfer learning are achieved.more » « less
-
Simulation-free methods for training continuous-time generative models construct probability paths that go between noise distributions and individual data samples. Recent works, such as Flow Matching, derived paths that are optimal for each data sample. However, these algorithms rely on independent data and noise samples, and do not exploit underlying structure in the data distribution for constructing probability paths. We propose Multisample Flow Matching, a more general framework that uses non-trivial couplings between data and noise samples while satisfying the correct marginal constraints. At very small overhead costs, this generalization allows us to (i) reduce gradient variance during training, (ii) obtain straighter flows for the learned vector field, which allows us to generate high-quality samples using fewer function evaluations, and (iii) obtain transport maps with lower cost in high dimensions, which has applications beyond generative modeling. Importantly, we do so in a completely simulation-free manner with a simple minimization objective. We show that our proposed methods improve sample consistency on downsampled ImageNet data sets, and lead to better low-cost sample generation.more » « less
-
Motivation: Publicly available k-space data used for training are inherently noisy with no available ground truth. Goal(s): To denoise k-space data in an unsupervised manner for downstream applications. Approach: We use Generalized Stein’s Unbiased Risk Estimate (GSURE) applied to multi-coil MRI to denoise images without access to ground truth. Subsequently, we train a generative model to show improved accelerated MRI reconstruction. Results: We demonstrate: (1) GSURE can successfully remove noise from k-space; (2) generative priors learned on GSURE-denoised samples produce realistic synthetic samples; and (3) reconstruction performance on subsampled MRI improves using priors trained on denoised images in comparison to training on noisy samples. Impact: This abstract shows that we can denoise multi-coil data without ground truth and train deep generative models directly on noisy k-space in an unsupervised manner, for improved accelerated reconstruction.more » « less
-
Modern generative models exhibit unprecedented capabilities to generate extremely realistic data. However, given the inherent compositionality of real world, reliable use of these models in practical applications mandates they exhibit the ability to compose their capabilities, generating and reasoning over entirely novel samples never seen in the training distribution. Prior work demonstrates recent vision diffusion models exhibit intriguing compositional generalization abilities, but also fail rather unpredictably. What are the reasons underlying this behavior? Which concepts does the model generally find difficult to compose to form novel data? To address these questions, we perform a controlled study of compositional generalization in conditional diffusion models in a synthetic setting, varying different attributes of the training data and measuring the model's ability to generate samples out-of-distribution. Our results show that: (i) the compositional structure of the data-generating process governs the order in which capabilities and an ability to compose them emerges; (ii) learning individual concepts impacts performance on compositional tasks, multiplicatively explaining sudden emergence; and (iii) learning and composing capabilities is difficult under correlations. We hope our study inspires further grounded research on understanding capabilities and compositionality in generative models from a data-centric perspective.more » « less
An official website of the United States government
