skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Noise in Datasets: What Are the Impacts on Classification Performance? [Noise in Datasets: What Are the Impacts on Classification Performance?]
Classification is one of the fundamental tasks in machine learning. The quality of data is important in con- structing any machine learning model with good prediction performance. Real-world data often suffer from noise which is usually referred to as errors, irregularities, and corruptions in a dataset. However, we have no control over the quality of data used in classification tasks. The presence of noise in a dataset poses three major negative consequences, viz. (i) a decrease in the classification accuracy (ii) an increase in the complexity of the induced classifier (iii) an increase in the training time. Therefore, it is important to systematically explore the effects of noise in classification performance. Even though there have been published studies on the effect of noise either for some particular learner or for some particular noise type, there is a lack of study where the impact of different noise on different learners has been investigated. In this work, we focus on both scenar- ios: various learners and various noise types and provide a detailed analysis of their effects on the prediction performance. We use five different classifiers (J48, Naive Bayes, Support Vector Machine, k-Nearest Neigh- bor, Random Forest) and 10 benchmark datasets from the UCI machine learning repository and three publicly available image datasets. Our results can be used to guide the development of noise handling mechanisms.  more » « less
Award ID(s):
1946231
PAR ID:
10319530
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 11th International Conference on Pattern Recognition Applications and Methods
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Text classification is a fundamental problem, and recently, deep neural networks (DNN) have shown promising results in many natural language tasks. However, their human-level performance relies on high-quality annotations, which are time-consuming and expensive to collect. As we move towards large inexpensive datasets, the inherent label noise degrades the generalization of DNN. While most machine learning literature focuses on building complex networks to handle noise, in this work, we evaluate model-agnostic methods to handle inherent noise in large scale text classification that can be easily incorporated into existing machine learning workflows with minimal interruption. Specifically, we conduct a point-by-point comparative study between several noise-robust methods on three datasets encompassing three popular classification models. To our knowledge, this is the first time such a comprehensive study in text classification encircling popular models and model-agnostic loss methods has been conducted. In this study, we describe our learning and demonstrate the application of our approach, which outperformed baselines by up to 10% in classification accuracy while requiring no network modifications. 
    more » « less
  2. null (Ed.)
    Predicting coarse-grain variations in workload behavior during execution is essential for dynamic resource optimization of processor systems. Researchers have proposed various methods to first classify workloads into phases and then learn their long-term phase behavior to predict and anticipate phase changes. Early studies on phase prediction proposed table-based phase predictors. More recently, simple learning-based techniques such as decision trees have been explored. However, more recent advances in machine learning have not been applied to phase prediction so far. Furthermore, existing phase predictors have been studied only in connection with specific phase classifiers even though there is a wide range of classification methods. Early work in phase classification proposed various clustering methods that required access to source code. Some later studies used performance monitoring counters, but they only evaluated classifiers for specific contexts such as thermal modeling. In this work, we perform a comprehensive study of source-oblivious phase classification and prediction methods using hardware counters. We adapt classification techniques that were used with different inputs in the past and compare them to state-of-the-art hardware counter based classifiers. We further evaluate the accuracy of various phase predictors when coupled with different phase classifiers and evaluate a range of advanced machine learning techniques, including SVMs and LSTMs for workload phase prediction. We apply classification and prediction approaches to SPEC workloads running on an Intel Core-i9 platform. Results show that a two-level kmeans clustering combined with SVM-based phase change prediction provides the best tradeoff between accuracy and long-term stability. Additionally, the SVM predictor reduces the average prediction error by 80% when compared to a table-based predictor. 
    more » « less
  3. Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library. 
    more » « less
  4. This paper proposes to enable deep learning for generic machine learning tasks. Our goal is to allow deep learning to be applied to data which are already represented in instance feature tabular format for a better classification accuracy. Because deep learning relies on spatial/temporal correlation to learn new feature representation, our theme is to convert each instance of the original dataset into a synthetic matrix format to take the full advantage of the feature learning power of deep learning methods. To maximize the correlation of the matrix , we use 0/1 optimization to reorder features such that the ones with strong correlations are adjacent to each other. By using a two dimensional feature reordering, we are able to create a synthetic matrix, as an image, to represent each instance. Because the synthetic image preserves the original feature values and data correlation, existing deep learning algorithms, such as convolutional neural networks (CNN), can be applied to learn effective features for classification. Our experiments on 20 generic datasets, using CNN as the deep learning classifier, confirm that enabling deep learning to generic datasets has clear performance gain, compared to generic machine learning methods. In addition, the proposed method consistently outperforms simple baselines of using CNN for generic dataset. As a result, our research allows deep learning to be broadly applied to generic datasets for learning and classification 
    more » « less
  5. The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing. 
    more » « less