skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fair Comparison: Quantifying Variance in Results for Fine-grained Visual Categorization
For the task of image classification, researchers work arduously to develop the next state-of-the-art (SOTA) model, each bench-marking their own performance against that of their predecessors and of their peers. Unfortunately, the metric used most frequently to describe a model’s performance, average categorization accuracy, is often used in isolation. As the number of classes increases, such as in fine-grained visual categorization (FGVC), the amount of information conveyed by average accuracy alone dwindles. While its most glaring weakness is its failure to describe the model’s performance on a class-by-class basis, average accuracy also fails to describe how performance may vary from one trained model of the same architecture, on the same dataset, to another (both averaged across all categories and at the per-class level). We first demonstrate the magnitude of these variations across models and across class distributions based on attributes of the data, comparing results on different visual domains and different per-class image distributions, including long-tailed distributions and few-shot subsets. We then analyze the impact various FGVC methods have on overall and per-class variance. From this analysis, we both highlight the importance of reporting and comparing methods based on information beyond overall accuracy, as well as point out techniques that mitigate variance in FGVC results.  more » « less
Award ID(s):
1651832
PAR ID:
10416680
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2021 IEEE Winter Conference on Applications of Computer Vision (WACV)
Page Range / eLocation ID:
3308 to 3317
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. While the community has seen many advances in recent years to address the challenging problem of Fine-grained Visual Categorization (FGVC), progress seems to be slowing—new state-of-the-art methods often distinguish themselves by improving top-1 accuracy by mere tenths of a percent. However, across all of the now-standard FGVC datasets, there remain sizeable portions of the test data that none of the current state-of-the-art (SOTA) models can successfully predict. This paper provides a framework for identifying and studying the errors that current methods make across diverse fine-grained datasets. Three models of difficulty—Prediction Overlap, Prediction Rank and Pair-wise Class Confusion—are employed to highlight the most challenging sets of images and classes. Extensive experiments apply a range of standard and SOTA methods, evaluating them on multiple FGVC domains and datasets. Insights acquired from coupling these difficulty paradigms with the careful analysis of experimental results suggest crucial areas for future FGVC research, focusing critically on the set of elusive images that none of the current models can correctly classify. Code is available at catalys1.github.io/elusive-images-fgvc. 
    more » « less
  2. Key recognition tasks such as fine-grained visual categorization (FGVC) have benefited from increasing attention among computer vision researchers. The development and evaluation of new approaches relies heavily on benchmark datasets; such datasets are generally built primarily with categories that have images readily available, omitting categories with insufficient data. This paper takes a step back and rethinks dataset construction, focusing on intelligent image collection driven by: (i) the inclusion of all desired categories, and, (ii) the recognition performance on those categories. Based on a small, author-provided initial dataset, the proposed system recommends which categories the authors should prioritize collecting additional images for, with the intent of optimizing overall categorization accuracy. We show that mock datasets built using this method outperform datasets built without such a guiding framework. Additional experiments give prospective dataset creators intuition into how, based on their circumstances and goals, a dataset should be constructed. 
    more » « less
  3. Nature and wildlife observation is the practice of notign both the occurrence and abundance of plant or animal species at a specific location and time. Common exam-ples of this type of activity are bird watching (birding), insect collecting, and plant observation (botanizing), and these are widely accepted as both recreational and scien-tific activities in their respective fields. However, many highly-similar species are difficult to disambiguate; identi-fying an observed specimen requires expert knowledge and experience in many cases. This hard problem is called Fine-grained Visual Categorization (FGVC) and focuses on dif-ferentiating between hard-to-distinguish object classes. Ex-amples of such fine-level classification include discriminatign between similar species of plants and animals or iden-tifying the make and model of vehicles, instead of recognizing these objects at a coarse level. An FGVC example of butterflies is shown in Figure 1. These two species have similar colors and shapes, but the patterns on the wings are distinct. When presented with near-identical poses as in the figure, this classification can be performed very effectively by a machine. However, in more extreme conditions of pose, illumination, occlusion, etc, the task becomes much harder. While machines struggle in such scenarios, humans can still find the needed visual cues and differences by fac-toring in the pose of the butterfly and comparing patterns on common parts; in part, because humans can infer an ob-ject's rough 3D shape, understand the lighting and camera angle, and even envision what it would look like from an-other pose. Humans have developed a 3D understanding of a butterfly because we have seen moving butterflies previ-ously. What if machines had the same information about the object? Information such as object pose, camera angle, object texture, and part labels, would undoubtedly help im-prove performance on the FGVC task. 
    more » « less
  4. Human scene categorization is characterized by its remarkable speed. While many visual and conceptual features have been linked to this ability, significant correlations exist between feature spaces, impeding our ability to determine their relative contributions to scene categorization. Here, we used a whitening transformation to decorrelate a variety of visual and conceptual features and assess the time course of their unique contributions to scene categorization. Participants (both sexes) viewed 2250 full-color scene images drawn from 30 different scene categories while having their brain activity measured through 256-channel EEG. We examined the variance explained at each electrode and time point of visual event-related potential (vERP) data from nine different whitened encoding models. These ranged from low-level features obtained from filter outputs to high-level conceptual features requiring human annotation. The amount of category information in the vERPs was assessed through multivariate decoding methods. Behavioral similarity measures were obtained in separate crowdsourced experiments. We found that all nine models together contributed 78% of the variance of human scene similarity assessments and were within the noise ceiling of the vERP data. Low-level models explained earlier vERP variability (88 ms after image onset), whereas high-level models explained later variance (169 ms). Critically, only high-level models shared vERP variability with behavior. Together, these results suggest that scene categorization is primarily a high-level process, but reliant on previously extracted low-level features. 
    more » « less
  5. Fine-Grained Visual Classification (FGVC) datasets contain small sample sizes, along with significant intra-class variation and inter-class similarity. While prior work has addressed intra-class variation using localization and segmentation techniques, inter-class similarity may also affect feature learning and reduce classification performance. In this work, we address this problem using a novel optimization procedure for the end-to-end neural network training on FGVC tasks. Our procedure, called Pairwise Confusion (PC) reduces overfitting by intentionally introducing confusion in the activations. With PC regularization, we obtain state-of-the-art performance on six of the most widely-used FGVC datasets and demonstrate improved localization ability. PC is easy to implement, does not need excessive hyperparameter tuning during training, and does not add significant overhead during test time. 
    more » « less