Active learning methods, like uncertainty sampling, combined with probabilistic prediction techniques have achieved success in various problems like image classification and text classification. For more complex multivariate prediction tasks, the relationships between labels play an important role in designing structured classifiers with better performance. However, computational time complexity limits prevalent probabilistic methods from effectively supporting active learning. Specifically, while non-probabilistic methods based on structured support vector ma-chines can be tractably applied to predicting cuts and bipartite matchings, conditional random fields are intractable for these structures. We propose an adversarial approach for active learning with structured prediction domains that is tractable for cuts and matching. We evaluate this approach algorithmically in two important structured prediction problems: multi-label classification and object tracking in videos. We demonstrate better accuracy and computational efficiency for our proposed method.
more »
« less
Statistical Analysis of Imbalanced Classification with Training Size Variation and Subsampling on Datasets of Research Papers in Biomedical Literature
The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing.
more »
« less
- Award ID(s):
- 2131207
- PAR ID:
- 10478838
- Publisher / Repository:
- MDPI, https://doi.org/10.3390/make5040095
- Date Published:
- Journal Name:
- Machine Learning and Knowledge Extraction
- Volume:
- 5
- Issue:
- 4
- ISSN:
- 2504-4990
- Page Range / eLocation ID:
- 1953-1978
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In the era of big data, data-driven based classification has become an essential method in smart manufacturing to guide production and optimize inspection. The industrial data obtained in practice is usually time-series data collected by soft sensors, which are highly nonlinear, nonstationary, imbalanced, and noisy. Most existing soft-sensing machine learning models focus on capturing either intra-series temporal dependencies or pre-defined inter-series correlations, while ignoring the correlation between labels as each instance is associated with multiple labels simultaneously. In this paper, we propose a novel graph based soft-sensing neural network (GraSSNet) for multivariate time-series classification of noisy and highly-imbalanced soft-sensing data. The proposed GraSSNet is able to 1) capture the inter-series and intra-series dependencies jointly in the spectral domain; 2) exploit the label correlations by superimposing label graph that built from statistical co-occurrence information; 3) learn features with attention mechanism from both textual and numerical domain; and 4) leverage unlabeled data and mitigate data imbalance by semi-supervised learning. Comparative studies with other commonly used classifiers are carried out on Seagate soft sensing data, and the experimental results validate the competitive performance of our proposed method.more » « less
-
We give relative error coresets for training linear classifiers with a broad class of loss functions, including the logistic loss and hinge loss. Our construction achieves $$(1\pm \epsilon)$$ relative error with $$\tilde O(d \cdot \mu_y(X)^2/\epsilon^2)$$ points, where $$\mu_y(X)$$ is a natural complexity measure of the data matrix $$X \in \mathbb{R}^{n \times d}$$ and label vector $$y \in \{-1,1\}^n$$, introduced in Munteanu et al. 2018. Our result is based on subsampling data points with probabilities proportional to their \textit{$$\ell_1$$ Lewis weights}. It significantly improves on existing theoretical bounds and performs well in practice, outperforming uniform subsampling along with other importance sampling methods. Our sampling distribution does not depend on the labels, so can be used for active learning. It also does not depend on the specific loss function, so a single coreset can be used in multiple training scenarios.more » « less
-
Furht, Borko; Khoshgoftaar, Taghi (Ed.)Acquiring labeled datasets often incurs substantial costs primarily due to the requirement of expert human intervention to produce accurate and reliable class labels. In the modern data landscape, an overwhelming proportion of newly generated data is unlabeled. This paradigm is especially evident in domains such as fraud detection and datasets for credit card fraud detection. These types of data have their own difficulties associated with being highly class imbalanced, which poses its own challenges to machine learning and classification. Our research addresses these challenges by extensively evaluating a novel methodology for synthesizing class labels for highly imbalanced credit card fraud data. The methodology uses an autoencoder as its underlying learner to effectively learn from dataset features to produce an error metric for use in creating new binary class labels. The methodology aims to automatically produce new labels with minimal expert input. These class labels are then used to train supervised classifiers for fraud detection. Our empirical results show that the synthesized labels are of high enough quality to produce classifiers that significantly outperform a baseline learner comparison when using area under the precision-recall curve (AUPRC). We also present results of varying levels of positive-labeled instances and their effect on classifier performance. Results show that AUPRC performance improves as more instances are labeled positive and belong to the minority class. Our methodology thereby effectively addresses the concerns of high class imbalance in machine learning by creating new and effective class labels.more » « less
-
Recent advances in machine learning have led to increased deployment of black-box classifiers across a wide variety of applications. In many such situations there is a critical need to both reliably assess the performance of these pre-trained models and to perform this assessment in a label-efficient manner (given that labels may be scarce and costly to collect). In this paper, we introduce an active Bayesian approach for assessment of classifier performance to satisfy the desiderata of both reliability and label-efficiency. We begin by developing inference strategies to quantify uncertainty for common assessment metrics such as accuracy, misclassification cost, and calibration error. We then propose a general framework for active Bayesian assessment using inferred uncertainty to guide efficient selection of instances for labeling, enabling better performance assessment with fewer labels. We demonstrate significant gains from our proposed active Bayesian approach via a series of systematic empirical experiments assessing the performance of modern neural classifiers (e.g., ResNet and BERT) on several standard image and text classification datasets.more » « less