skip to main content


Title: mCRF and mRD: Two Classification Methods Based on a Novel Multiclass Label Noise Filtering Learning Framework
Mitigating label noise is a crucial problem in classification. Noise filtering is an effective method of dealing with label noise which does not need to estimate the noise rate or rely on any loss function. However, most filtering methods focus mainly on binary classification, leaving the more difficult counterpart problem of multiclass classification relatively unexplored. To remedy this deficit, we present a definition for label noise in a multiclass setting and propose a general framework for a novel label noise filtering learning method for multiclass classification. Two examples of noise filtering methods for multiclass classification, multiclass complete random forest (mCRF) and multiclass relative density, are derived from their binary counterparts using our proposed framework. In addition, to optimize the NI_threshold hyperparameter in mCRF, we propose two new optimization methods: a new voting cross-validation method and an adaptive method that employs a 2-means clustering algorithm. Furthermore, we incorporate SMOTE into our label noise filtering learning framework to handle the ubiquitous problem of imbalanced data in multiclass classification. We report experiments on both synthetic data sets and UCI benchmarks to demonstrate our proposed methods are highly robust to label noise in comparison with state-of-the-art baselines. All code and data results are available at https://github.com/syxiaa/Multiclass-Label-Noise-Filtering-Learning.  more » « less
Award ID(s):
1631776
NSF-PAR ID:
10286755
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
IEEE Transactions on Neural Networks and Learning Systems
ISSN:
2162-237X
Page Range / eLocation ID:
1 to 15
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. AIDS is a syndrome caused by the HIV. During the progression of AIDS, a patient's immune system is weakened, which increases the patient's susceptibility to infections and diseases. Although antiretroviral drugs can effectively suppress HIV, the virus mutates very quickly and can become resistant to treatment. In addition, the virus can also become resistant to other treatments not currently being used through mutations, which is known in the clinical research community as cross-resistance. Since a single HIV strain can be resistant to multiple drugs, this problem is naturally represented as a multilabel classification problem. Given this multilabel relationship, traditional single-label classification methods often fail to effectively identify the drug resistances that may develop after a particular virus mutation. In this work, we propose a novel multilabel Robust Sample Specific Distance (RSSD) method to identify multiclass HIV drug resistance. Our method is novel in that it can illustrate the relative strength of the drug resistance of a reverse transcriptase (RT) sequence against a given drug nucleoside analog and learn the distance metrics for all the drug resistances. To learn the proposed RSSDs, we formulate a learning objective that maximizes the ratio of the summations of a number of ℓ1-norm distances, which is difficult to solve in general. To solve this optimization problem, we derive an efficient, nongreedy iterative algorithm with rigorously proved convergence. Our new method has been verified on a public HIV type 1 drug resistance data set with over 600 RT sequences and five nucleoside analogs. We compared our method against several state-of-the-art multilabel classification methods, and the experimental results have demonstrated the effectiveness of our proposed method. 
    more » « less
  2. With the advances in machine learning for the diagnosis of Alzheimer’s disease (AD), most studies have focused on either identifying the subject’s status through classification algorithms or on predicting their cognitive scores through regression methods, neglecting the potential association between these two tasks. Motivated by the need to enhance the prospects for early diagnosis along with the ability to predict future disease states, this study proposes a deep neural network based on modality fusion, kernelization, and tensorization that perform multiclass classification and longitudinal regression simultaneously within a unified multitask framework. This relationship between multiclass classification and longitudinal regression is found to boost the efficacy of the final model in dealing with both tasks. Different multimodality scenarios are investigated, and complementary aspects of the multimodal features are exploited to simultaneously delineate the subject’s label and predict related cognitive scores at future timepoints using baseline data. The main intent in this multitask framework is to consolidate the highest accuracy possible in terms of precision, sensitivity, F1 score, and area under the curve (AUC) in the multiclass classification task while maintaining the highest similarity in the MMSE score as measured through the correlation coefficient and the RMSE for all time points under the prediction task, with both tasks, run simultaneously under the same set of hyperparameters. The overall accuracy for multiclass classification of the proposed KTMnet method is 66.85 ± 3.77. The prediction results show an average RMSE of 2.32 ± 0.52 and a correlation of 0.71 ± 5.98 for predicting MMSE throughout the time points. These results are compared to state-of-the-art techniques reported in the literature. A discovery from the multitasking of this consolidated machine learning framework is that a set of hyperparameters that optimize the prediction results may not necessarily be the same as those that would optimize the multiclass classification. In other words, there is a breakpoint beyond which enhancing further the results of one process could lead to the downgrading in accuracy for the other. 
    more » « less
  3. Aidong Zhang ; Huzefa Rangwala (Ed.)
    In many scenarios, 1) data streams are generated in real time; 2) labeled data are expensive and only limited labels are available in the beginning; 3) real-world data is not always i.i.d. and data drift over time gradually; 4) the storage of historical streams is limited. This learning setting limits the applicability and availability of many Machine Learning (ML) algorithms. We generalize the learning task under such setting as a semi-supervised drifted stream learning with short lookback problem (SDSL). SDSL imposes two under-addressed challenges on existing methods in semi-supervised learning and continuous learning: 1) robust pseudo-labeling under gradual shifts and 2) anti-forgetting adaptation with short lookback. To tackle these challenges, we propose a principled and generic generation-replay framework to solve SDSL. To achieve robust pseudo-labeling, we develop a novel pseudo-label classification model to leverage supervised knowledge of previously labeled data, unsupervised knowledge of new data, and, structure knowledge of invariant label semantics. To achieve adaptive anti-forgetting model replay, we propose to view the anti-forgetting adaptation task as a flat region search problem. We propose a novel minimax game-based replay objective function to solve the flat region search problem and develop an effective optimization solver. Experimental results demonstrate the effectiveness of the proposed method. 
    more » « less
  4. Active learning is commonly used to train label-efficient models by adaptively selecting the most informative queries. However, most active learning strategies are designed to either learn a representation of the data (e.g., embedding or metric learning) or perform well on a task (e.g., classification) on the data. However, many machine learning tasks involve a combination of both representation learning and a task-specific goal. Motivated by this, we propose a novel unified query framework that can be applied to any problem in which a key component is learning a representation of the data that reflects similarity. Our approach builds on similarity or nearest neighbor (NN) queries which seek to select samples that result in improved embeddings. The queries consist of a reference and a set of objects, with an oracle selecting the object most similar (i.e., nearest) to the reference. In order to reduce the number of solicited queries, they are chosen adaptively according to an information theoretic criterion. We demonstrate the effectiveness of the proposed strategy on two tasks - active metric learning and active classification - using a variety of synthetic and real world datasets. In particular, we demonstrate that actively selected NN queries outperform recently developed active triplet selection methods in a deep metric learning setting. Further, we show that in classification, actively selecting class labels can be reformulated as a process of selecting the most informative NN query, allowing direct application of our method. 
    more » « less
  5. null (Ed.)
    3D object detection is an important yet demanding task that heavily relies on difficult to obtain 3D annotations. To reduce the required amount of supervision, we propose 3DIoUMatch, a novel semi-supervised method for 3D object detection applicable to both indoor and outdoor scenes. We leverage a teacher-student mutual learning framework to propagate information from the labeled to the unlabeled train set in the form of pseudo-labels. However, due to the high task complexity, we observe that the pseudo-labels suffer from significant noise and are thus not directly usable. To that end, we introduce a confidence-based filtering mechanism, inspired by FixMatch. We set confidence thresholds based upon the predicted objectness and class probability to filter low-quality pseudo-labels. While effective, we observe that these two measures do not sufficiently capture localization quality. We therefore propose to use the estimated 3D IoU as a localization metric and set category-aware self-adjusted thresholds to filter poorly localized proposals. We adopt VoteNet as our backbone detector on indoor datasets while we use PV-RCNN on the autonomous driving dataset, KITTI. Our method consistently improves state-of-the-art methods on both ScanNet and SUN-RGBD benchmarks by significant margins under all label ratios (including fully labeled setting). For example, when training using only 10% labeled data on ScanNet, 3DIoUMatch achieves 7.7 absolute improvement on mAP@0.25 and 8.5 absolute improvement on mAP@0.5 upon the prior art. On KITTI, we are the first to demonstrate semi-supervised 3D object detection and our method surpasses a fully supervised baseline from 1.8% to 7.6% under different label ratio and categories. 
    more » « less