Training of large-scale models in general requires enormous amounts of traning data. Dataset distillation aims to extract a small set of synthetic training samples from a large dataset with the goal of achieving competitive performance on test data when trained on this sample, thus reducing both dataset size and training time. In this work, we tackle dataset distillation at its core by treating it directly as a bilevel optimization problem. Re-examining the foundational back-propagation through time method, we study the pronounced variance in the gradients, computational burden, and long-term dependencies. We introduce an improved method: Random Truncated Backpropagation Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation coupled with a random window, effectively stabilizing the gradients and speeding up the optimization while covering long dependencies. This allows us to establish new dataset distillation state-of-the-art for a variety of standard dataset benchmarks. 
                        more » 
                        « less   
                    
                            
                            Embarrassingly Simple Dataset Distillation
                        
                    
    
            Dataset distillation extracts a small set of synthetic training samples from a large dataset with the goal of achieving competitive performance on test data when trained on this sample. In this work, we tackle dataset distillation at its core by treating it directly as a bilevel optimization problem. Re-examining the foundational back-propagation through time method, we study the pronounced variance in the gradients, computational burden, and long-term dependencies. We introduce an improved method: Random Truncated Backpropagation Through Time (RaT-BPTT) to address them. RaT-BPTT incorporates a truncation coupled with a random window, effectively stabilizing the gradients and speeding up the optimization while covering long dependencies. This allows us to establish new state-of-the-art for a variety of standard dataset benchmarks. A deeper dive into the nature of distilled data unveils pronounced intercorrelation. In particular, subsets of distilled datasets tend to exhibit much worse performance than directly distilled smaller datasets of the same size. Leveraging RaT-BPTT, we devise a boosting mechanism that generates distilled datasets that contain subsets with near optimal performance across different data budgets. 
        more » 
        « less   
        
    
                            - Award ID(s):
 - 1922658
 
- PAR ID:
 - 10534878
 
- Publisher / Repository:
 - International Conference on Learning Representations
 
- Date Published:
 
- Format(s):
 - Medium: X
 
- Sponsoring Org:
 - National Science Foundation
 
More Like this
- 
            
 - 
            In classification problems, mislabeled data can have a dramatic effect on the capability of a trained model. The traditional method of dealing with mislabeled data is through expert review. However, this is not always ideal, due to the large volume of data in many classification datasets, such as image datasets supporting deep learning models, and the limited availability of human experts for reviewing the data. Herein, we propose an ordered sample consensus (ORSAC) method to support data cleaning by flagging mislabeled data. This method is inspired by the random sample consensus (RANSAC) method for outlier detection. In short, the method involves iteratively training and testing a model on different splits of the dataset, recording misclassifications, and flagging data that is frequently misclassified as probably mislabeled. We evaluate the method by purposefully mislabeling subsets of data and assessing the method’s capability to find such data. We demonstrate with three datasets, a mosquito image dataset, CIFAR-10, and CIFAR-100, that this method is reliable in finding mislabeled data with a high degree of accuracy. Our experimental results indicate a high proficiency of our methodology in identifying mislabeled data across these diverse datasets, with performance assessed using different mislabeling frequencies.more » « less
 - 
            Knowledge distillation aims at reducing model size without compromising much performance. Recent work has applied it to large vision-language (VL) Transformers, and has shown that attention maps in the multi-head attention modules of vision-language Transformers contain extensive intra-modal and cross-modal co-reference relations to be distilled. The standard approach is to apply a one-to-one attention map distillation loss, i.e. the Teacher’s first attention head instructs the Student’s first head, the second teaches the second, and so forth, but this only works when the numbers of attention heads in the Teacher and Student are the same. To remove this constraint, we propose a new Attention Map Alignment Distillation (AMAD) method for Transformers with multi-head attention, which works for a Teacher and a Student with different numbers of attention heads. Specifically, we soft-align different heads in Teacher and Student attention maps using a cosine similarity weighting. The Teacher head contributes more to the Student heads for which it has a higher similarity weight. Each Teacher head contributes to all the Student heads by minimizing the divergence between the attention activation distributions for the soft-aligned heads. No head is left behind. This distillation approach operates like cross-attention. We experiment on distilling VL-T5 and BLIP, and apply AMAD loss on their T5, BERT, and ViT sub-modules. We show, under vision-language setting, that AMAD outperforms conventional distillation methods on VQA-2.0, COCO captioning, and Multi30K translation datasets. We further show that even without VL pre-training, the distilled VL-T5 models outperform corresponding VL pre-trained VL-T5 models that are further fine-tuned by ground-truth signals, and that fine-tuning distillation can also compensate to some degree for the absence of VL pre-training for BLIP models.more » « less
 - 
            Abstract Abundance-weighted averaging is a simple and common method for estimating taxon preferences (optima) for phosphorus (P) and other environmental drivers of freshwater-ecosystem health. These optima can then be used to develop transfer functions to infer current and/or past environmental conditions of aquatic ecosystems in water-quality assessments and/or paleolimnological studies. However, estimates of species’ environmental preferences are influenced by the sample distribution and length of environmental gradients, which can differ between datasets used to develop and apply a transfer function. Here, we introduce a subsampling method to ensure a uniform and comparable distribution of samples along a P gradient in two similar ecosystems: the Everglades Protection Areas (EPA) and Big Cypress National Preserve (BICY) in South Florida, USA. Diatom optima were estimated for both wetlands using weighted averaging of untransformed and log-transformed periphyton mat total phosphorus (mat TP) values from the original datasets. We compared these estimates to those derived from random subsets of the original datasets. These subsets, referred to as “SUD” datasets, were created to ensure a uniform distribution of mat TP values along the gradient (both untransformed and log-transformed). We found that diatom assemblages in BICY and EPA were similar, dominated by taxa indicating oligotrophic conditions, and strongly influenced by P gradients. However, the original BICY datasets contained more samples with elevated mat TP concentrations than the EPA datasets, introducing a mathematical bias and resulting in a higher abundance of taxa with high mat TP optima in BICY. The weighted averaged mat TP optima of BICY and EPA taxa were positively correlated across all four dataset types, with taxa optima of SUD datasets exhibiting higher correlations than in the original datasets. Equalizing the mat TP sample distribution in the two datasets confirmed consistent mat TP estimates for diatom taxa between the two wetland complexes and improved transfer-function performance. Our findings suggest that diatom environmental preferences may be more reliable across regional scales than previously suggested and support the application of models developed in one region to another nearby region if environmental gradient lengths are equalized and data distribution along gradients is uniform.more » « less
 - 
            null (Ed.)This paper presents an active distillation method for a local institution (e.g., hospital) to find the best queries within its given budget to distill an on-server black-box model’s predictive knowledge into a local surrogate with transparent parameterization. This allows local institutions to understand better the predictive reasoning of the black-box model in its own local context or to further customize the distilled knowledge with its private dataset that cannot be centralized and fed into the server model. The proposed method thus addresses several challenges of deploying machine learning (ML) in many industrial settings (e.g., healthcare analytics) with strong proprietary constraints. These include: (1) the opaqueness of the server model’s architecture which prevents local users from understanding its predictive reasoning in their local data contexts; (2) the increasing cost and risk of uploading local data on the cloud for analysis; and (3) the need to customize the server model with private onsite data. We evaluated the proposed method on both benchmark and real-world healthcare data where significant improvements over existing local distillation methods were observed. A theoretical analysis of the proposed method is also presented.more » « less
 
An official website of the United States government 
				
			
                                    