Facial attribute prediction is a facial analysis task that describes images using natural language features. While many works have attempted to optimize prediction accuracy on CelebA, the largest and most widely used facial attribute dataset, few works have analyzed the accuracy of the dataset's attribute labels. In this paper, we seek to do just that. Despite the popularity of CelebA, we find through quantitative analysis that there are widespread inconsistencies and inaccuracies in its attribute labeling. We estimate that at least one third of all images have one or more incorrect labels, and reliable predictions are impossible for several attributes due to inconsistent labeling. Our results demonstrate that classifiers struggle with many CelebA attributes not because they are difficult to predict, but because they are poorly labeled. This indicates that the CelebA dataset is flawed as a facial analysis tool and may not be suitable as a generic evaluation benchmark for imbalanced classification.
more »
« less
Improving Evaluation of Facial Attribute Prediction Models
CelebA is the most common and largest scale dataset used to evaluate methods for facial attribute prediction, an important benchmark in imbalanced classification and face analysis. However, we argue that the evaluation metrics and baseline models currently used to compare the performance of different methods are insufficient for determining which approaches are best at classifying highly imbalanced attributes. We are able to obtain results comparable to current state-of-the-art using a ResNet-18 model trained with binary cross-entropy, a substantially less sophisticated approach than related work. We also show that we can obtain near-state-of-the-art results on accuracy using a model trained with just 10% of CelebA, and on balanced accuracy simply by maximizing recall for imbalanced attributes at the expense of all other metrics. To deal with these issues, we suggest several improvements to model evaluation including better metrics, stronger baselines, and increased awareness of the limitations of the dataset.
more »
« less
- Award ID(s):
- 1909707
- PAR ID:
- 10342431
- Date Published:
- Journal Name:
- IEEE International Conference on Automatic Face and Gesture Recognition
- Volume:
- 16
- Page Range / eLocation ID:
- 1 to 7
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In the field of healthcare, electronic health records (EHR) serve as crucial training data for developing machine learning models for diagnosis, treatment, and the management of healthcare resources. However, medical datasets are often imbalanced in terms of sensitive attributes such as race/ethnicity, gender, and age. Machine learning models trained on class-imbalanced EHR datasets perform significantly worse in deployment for individuals of the minority classes compared to those from majority classes, which may lead to inequitable healthcare outcomes for minority groups. To address this challenge, we propose Minority Class Rebalancing through Augmentation by Generative modeling (MCRAGE), a novel approach to augment imbalanced datasets using samples generated by a deep generative model. The MCRAGE process involves training a Conditional Denoising Diffusion Probabilistic Model (CDDPM) capable of generating high-quality synthetic EHR samples from underrepresented classes. We use this synthetic data to augment the existing imbalanced dataset, resulting in a more balanced distribution across all classes, which can be used to train less biased downstream models. We measure the performance of MCRAGE versus alternative approaches using Accuracy, F1 score and AUROC of these downstream models. We provide theoretical justification for our method in terms of recent convergence results for DDPMs.more » « less
-
Evaluating classification accuracy is a key component of the training and validation stages of thematic map production, and the choice of metric has profound implications for both the success of the training process and the reliability of the final accuracy assessment. We explore key considerations in selecting and interpreting loss and assessment metrics in the context of data imbalance, which arises when the classes have unequal proportions within the dataset or landscape being mapped. The challenges involved in calculating single, integrated measures that summarize classification success, especially for datasets with considerable data imbalance, have led to much confusion in the literature. This confusion arises from a range of issues, including a lack of clarity over the redundancy of some accuracy measures, the importance of calculating final accuracy from population-based statistics, the effects of class imbalance on accuracy statistics, and the differing roles of accuracy measures when used for training and final evaluation. In order to characterize classification success at the class level, users typically generate averages from the class-based measures. These averages are sometimes generated at the macro-level, by taking averages of the individual-class statistics, or at the micro-level, by aggregating values within a confusion matrix, and then, calculating the statistic. We show that the micro-averaged producer’s accuracy (recall), user’s accuracy (precision), and F1-score, as well as weighted macro-averaged statistics where the class prevalences are used as weights, are all equivalent to each other and to the overall accuracy, and thus, are redundant and should be avoided. Our experiment, using a variety of loss metrics for training, suggests that the choice of loss metric is not as complex as it might appear to be, despite the range of choices available, which include cross-entropy (CE), weighted CE, and micro- and macro-Dice. The highest, or close to highest, accuracies in our experiments were obtained by using CE loss for models trained with balanced data, and for models trained with imbalanced data, the highest accuracies were obtained by using weighted CE loss. We recommend that, since weighted CE loss used with balanced training is equivalent to CE, weighted CE loss is a good all-round choice. Although Dice loss is commonly suggested as an alternative to CE loss when classes are imbalanced, micro-averaged Dice is similar to overall accuracy, and thus, is particularly poor for training with imbalanced data. Furthermore, although macro-Dice resulted in models with high accuracy when the training used balanced data, when the training used imbalanced data, the accuracies were lower than for weighted CE. In summary, the significance of this paper lies in its provision of readers with an overview of accuracy and loss metric terminology, insight regarding the redundancy of some measures, and guidance regarding best practices.more » « less
-
Deep Generative Networks (DGNs) are extensively employed in Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and their variants to approximate the data manifold and distribution. However, training samples are often distributed non-uniformly on the manifold, due to the cost or convenience of collection. For example, the CelebA dataset contains a large fraction of smiling faces. These inconsistencies will be reproduced when sampling from the trained DGN, which is not always preferred, e.g., for fairness or data augmentation. In response, we develop MaGNET, a novel and theoretically motivated latent space sampler for any pre-trained DGN that produces samples uniformly distributed on the learned manifold. We perform a range of experiments on several datasets and DGNs, e.g., for the state-of-the-art StyleGAN2 trained on the FFHQ dataset, uniform sampling via MaGNET increases distribution precision by 4.1% and recall by 3.0% and decreases gender bias by 41.2%, without requiring labels or retraining. Since uniform sample distribution does not imply uniform semantic distribution, we also explore how semantic attributes of generated samples vary under MaGNET sampling. Colab and codes at bit.ly/magnet-samplingmore » « less
-
GPS spoofing attacks are a severe threat to unmanned aerial vehicles. These attacks manipulate the true state of the unmanned aerial vehicles, potentially misleading the system without raising alarms. Several techniques, including machine learning, have been proposed to detect these attacks. Most of the studies applied machine learning models without identifying the best hyperparameters, using feature selection and importance techniques, and ensuring that the used dataset is unbiased and balanced. However, no current studies have discussed the impact of model parameters and dataset characteristics on the performance of machine learning models; therefore, this paper fills this gap by evaluating the impact of hyperparameters, regularization parameters, dataset size, correlated features, and imbalanced datasets on the performance of six most commonly known machine learning techniques. These models are Classification and Regression Decision Tree, Artificial Neural Network, Random Forest, Logistic Regression, Gaussian Naïve Bayes, and Support Vector Machine. Thirteen features extracted from legitimate and simulated GPS attack signals are used to perform this investigation. The evaluation was performed in terms of four metrics: accuracy, probability of misdetection, probability of false alarm, and probability of detection. The results indicate that hyperparameters, regularization parameters, correlated features, dataset size, and imbalanced datasets adversely affect a machine learning model’s performance. The results also show that the Classification and Regression Decision Tree classifier has an accuracy of 99.99%, a probability of detection of 99.98%, a probability of misdetection of 0.2%, and a probability of false alarm of 1.005%, after removing correlated features and using tuned parameters in a balanced dataset. Random Forest can achieve an accuracy of 99.94%, a probability of detection of 99.6%, a probability of misdetection of 0.4%, and a probability of false alarm of 1.01% in similar conditions.more » « less
An official website of the United States government

