skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 1, 2025

Title: An ORSAC method for data cleaning inspired by RANSAC
In classification problems, mislabeled data can have a dramatic effect on the capability of a trained model. The traditional method of dealing with mislabeled data is through expert review. However, this is not always ideal, due to the large volume of data in many classification datasets, such as image datasets supporting deep learning models, and the limited availability of human experts for reviewing the data. Herein, we propose an ordered sample consensus (ORSAC) method to support data cleaning by flagging mislabeled data. This method is inspired by the random sample consensus (RANSAC) method for outlier detection. In short, the method involves iteratively training and testing a model on different splits of the dataset, recording misclassifications, and flagging data that is frequently misclassified as probably mislabeled. We evaluate the method by purposefully mislabeling subsets of data and assessing the method’s capability to find such data. We demonstrate with three datasets, a mosquito image dataset, CIFAR-10, and CIFAR-100, that this method is reliable in finding mislabeled data with a high degree of accuracy. Our experimental results indicate a high proficiency of our methodology in identifying mislabeled data across these diverse datasets, with performance assessed using different mislabeling frequencies.  more » « less
Award ID(s):
2322335
PAR ID:
10556070
Author(s) / Creator(s):
; ;
Publisher / Repository:
International Journal of Informatics and Communication Technology (IJ-ICT)
Date Published:
Journal Name:
International Journal of Informatics and Communication Technology (IJ-ICT)
Volume:
13
Issue:
3
ISSN:
2252-8776
Page Range / eLocation ID:
484; 498
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Seafood mislabeling occurs when a market label is inaccurate, primarily in terms of species identity, but also regarding weight, geographic origin, or other characteristics. This widespread problem allows cheaper or illegally-caught species to be marketed as species desirable to consumers. Previous studies have identified red snapper (Lutjanus campechanus) as one of the most frequently mislabeled seafood species in the United States. To quantify how common mislabeling of red snapper is across North Carolina, the Seafood Forensics class at the University of North Carolina at Chapel Hill used DNA barcoding to analyze samples sold as “red snapper” from restaurants, seafood markets, and grocery stores purchased in ten counties. Of 43 samples successfully sequenced and identified, 90.7% were mislabeled. Only one grocery store chain (of four chains tested) accurately labeled red snapper. The mislabeling rate for restaurants and seafood markets was 100%. Vermilion snapper (Rhomboplites aurorubens) and tilapia (Oreochromis aureusandO. niloticus) were the species most frequently substituted for red snapper (13 of 39 mislabeled samples for both taxa, or 26 of 39 mislabeled total). This study builds on previous mislabeling research by collecting samples of a specific species in a confined geographic region, allowing local vendors and policy makers to better understand the scope of red snapper mislabeling in North Carolina. This methodology is also a model for other academic institutions to engage undergraduate researchers in mislabeling data collection, sample processing, and analysis. 
    more » « less
  2. Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that trans-formations which preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations which mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms RandAugment by 1.24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. 
    more » « less
  3. Abstract This article develops a conformal prediction method for classification tasks that can adapt to random label contamination in the calibration sample, often leading to more informative prediction sets with stronger coverage guarantees compared to existing approaches. This is obtained through a precise characterization of the coverage inflation (or deflation) suffered by standard conformal inferences in the presence of label contamination, which is then made actionable through a new calibration algorithm. Our solution can leverage different modelling assumptions about the contamination process, while requiring no knowledge of the underlying data distribution or of the inner workings of the classification model. The empirical performance of the proposed method is demonstrated through simulations and an application to object classification with the CIFAR-10H image data set. 
    more » « less
  4. Modern machine learning algorithms typically require large amounts of labeled training data to fit a reliable model. To minimize the cost of data collection, researchers often employ techniques such as crowdsourcing and web scraping. However, web data and human annotations are known to exhibit high margins of error, resulting in sizable amounts of incorrect labels. Poorly labeled training data can cause models to overfit to the noise distribution, crippling performance in real-world applications. In this work, we investigate the viability of using data augmentation in conjunction with semi-supervised learning to improve the label noise robustness of image classification models. We conduct several experiments using noisy variants of the CIFAR-10 image classification dataset to benchmark our method against existing algorithms. Experimental results show that our augmentative SSL approach improves upon the state-of-the-art. 
    more » « less
  5. Modern machine learning algorithms typically require large amounts of labeled training data to fit a reliable model. To minimize the cost of data collection, researchers often employ techniques such as crowdsourcing and web scraping. However, web data and human annotations are known to exhibit high margins of error, resulting in sizable amounts of incorrect labels. Poorly labeled training data can cause models to overfit to the noise distribution, crippling performance in real-world applications. In this work, we investigate the viability of using data augmentation in conjunction with semi-supervised learning to improve the label noise robustness of image classification models. We conduct several experiments using noisy variants of the CIFAR-10 image classification dataset to benchmark our method against existing algorithms. Experimental results show that our augmentative SSL approach improves upon the state-of-the-art. 
    more » « less