For the task of image classification, researchers work arduously to develop the next state-of-the-art (SOTA) model, each bench-marking their own performance against that of their predecessors and of their peers. Unfortunately, the metric used most frequently to describe a model’s performance, average categorization accuracy, is often used in isolation. As the number of classes increases, such as in fine-grained visual categorization (FGVC), the amount of information conveyed by average accuracy alone dwindles. While its most glaring weakness is its failure to describe the model’s performance on a class-by-class basis, average accuracy also fails to describe how performance may vary from one trained model of the same architecture, on the same dataset, to another (both averaged across all categories and at the per-class level). We first demonstrate the magnitude of these variations across models and across class distributions based on attributes of the data, comparing results on different visual domains and different per-class image distributions, including long-tailed distributions and few-shot subsets. We then analyze the impact various FGVC methods have on overall and per-class variance. From this analysis, we both highlight the importance of reporting and comparing methods based on information beyond overall accuracy, as well as point out techniques that mitigate variance in FGVC results. 
                        more » 
                        « less   
                    
                            
                            Elusive Images: Beyond Coarse Analysis for Fine-Grained Recognition
                        
                    
    
            While the community has seen many advances in recent years to address the challenging problem of Fine-grained Visual Categorization (FGVC), progress seems to be slowing—new state-of-the-art methods often distinguish themselves by improving top-1 accuracy by mere tenths of a percent. However, across all of the now-standard FGVC datasets, there remain sizeable portions of the test data that none of the current state-of-the-art (SOTA) models can successfully predict. This paper provides a framework for identifying and studying the errors that current methods make across diverse fine-grained datasets. Three models of difficulty—Prediction Overlap, Prediction Rank and Pair-wise Class Confusion—are employed to highlight the most challenging sets of images and classes. Extensive experiments apply a range of standard and SOTA methods, evaluating them on multiple FGVC domains and datasets. Insights acquired from coupling these difficulty paradigms with the careful analysis of experimental results suggest crucial areas for future FGVC research, focusing critically on the set of elusive images that none of the current models can correctly classify. Code is available at catalys1.github.io/elusive-images-fgvc. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1651832
- PAR ID:
- 10630061
- Publisher / Repository:
- IEEE
- Date Published:
- ISSN:
- 2642-9381
- ISBN:
- 979-8-3503-1892-0
- Page Range / eLocation ID:
- 818 to 828
- Format(s):
- Medium: X
- Location:
- Waikoloa, HI, USA
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Fine-Grained Visual Classification (FGVC) datasets contain small sample sizes, along with significant intra-class variation and inter-class similarity. While prior work has addressed intra-class variation using localization and segmentation techniques, inter-class similarity may also affect feature learning and reduce classification performance. In this work, we address this problem using a novel optimization procedure for the end-to-end neural network training on FGVC tasks. Our procedure, called Pairwise Confusion (PC) reduces overfitting by intentionally introducing confusion in the activations. With PC regularization, we obtain state-of-the-art performance on six of the most widely-used FGVC datasets and demonstrate improved localization ability. PC is easy to implement, does not need excessive hyperparameter tuning during training, and does not add significant overhead during test time.more » « less
- 
            Existing image-to-image transformation approaches primarily focus on synthesizing visually pleasing data. Generating images with correct identity labels is challenging yet much less explored. It is even more challenging to deal with image transformation tasks with large deformation in poses, viewpoints, or scales while preserving the identity, such as face rotation and object viewpoint morphing. In this paper, we aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image, which can thereby benefit the subsequent fine-grained image recognition and few-shot learning tasks. The generated images, transformed with large geometric deformation, do not necessarily need to be of high visual quality but are required to maintain as much identity information as possible. To this end, we adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image. In order to preserve the fine-grained contextual details of the input image during the deformable transformation, a constrained nonalignment connection method is proposed to construct learnable highways between intermediate convolution blocks in the generator. Moreover, an adaptive identity modulation mechanism is proposed to transfer the identity information into the output image effectively. Extensive experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models, and as a result significantly boosts the visual recognition performance in fine-grained few-shot learning.more » « less
- 
            State-of-the-art (SOTA) semi-supervised learning (SSL) methods have been highly successful in leveraging a mix of labeled and unlabeled data, often via self-training or pseudo-labeling. During pseudo-labeling, the model's predictions on unlabeled data are used for training and may result in confirmation bias where the model reinforces its own mistakes. In this work, we show that SOTA SSL methods often suffer from confirmation bias and demonstrate that this is often a result of using a poorly calibrated classifier for pseudo labeling. We introduce BaM-SSL, an efficient Bayesian Model averaging technique that improves uncertainty quantification in SSL methods with limited computational or memory overhead. We demonstrate that BaM-SSL mitigates confirmation bias in SOTA SSL methods across standard vision benchmarks of CIFAR-10, CIFAR-100, giving up to 16% improvement in test accuracy on the CIFAR-100 with 400 labels benchmark. Furthermore, we also demonstrate their effectiveness in additional realistic and challenging problems, such as class-imbalanced datasets and in photonics science.more » « less
- 
            Image manipulation localization (IML) is a critical technique in media forensics, focusing on identifying tampered regions within manipulated images. Most existing IML methods require extensive training on labeled datasets with both image-level and pixel-level annotations. These methods often struggle with new manipulation types and exhibit low generalizability. In this work, we propose a training-free IML approach using diffusion models. Our method adaptively selects an appropriate number of diffusion timesteps for each input image in the forward process and performs both conditional and unconditional reconstructions in the backward process without relying on external conditions. By comparing these reconstructions, we generate a localization map highlighting regions of manipulation based on inconsistencies. Extensive experiments were conducted using sixteen state-of-the-art (SoTA) methods across six IML datasets. The results demonstrate that our training-free method outperforms SoTA unsupervised and weakly-supervised techniques. Furthermore, our method competes effectively against fully-supervised methods on novel (unseen) manipulation types.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    