Unsupervised feature selection aims to select a subset from the original features that are most useful for the downstream tasks without external guidance information. While most unsupervised feature selection methods focus on ranking features based on the intrinsic properties of data, most of them do not pay much attention to the relationships between features, which often leads to redundancy among the selected features. In this paper, we propose a two-stage Second-Order unsupervised Feature selection via knowledge contrastive disTillation (SOFT) model that incorporates the second-order covariance matrix with the first-order data matrix for unsupervised feature selection. In the first stage, we learn a sparse attention matrix that can represent second-order relations between features by contrastively distilling the intrinsic structure. In the second stage, we build a relational graph based on the learned attention matrix and perform graph segmentation. To this end, we conduct feature selection by only selecting one feature from each cluster to decrease the feature redundancy. Experimental results on 12 public datasets show that SOFT outperforms classical and recent state-of-the-art methods, which demonstrates the effectiveness of our proposed method. Moreover, we also provide rich in-depth experiments to further explore several key factors of SOFT.
more »
« less
This content will become publicly available on November 20, 2026
Designing unsupervised mixed‐type feature selection techniques using the heterogeneous correlation matrix
Abstract Real‐life data often include both numerical and categorical features. When categorical features are ordinal, the Pearson correlation matrix (CM) can be extended to a heterogeneous CM (HCM), which combines Pearson's correlations (numerical‐numerical), polyserial correlations (numerical‐ordinal) and polychoric correlations (ordinal‐ordinal). HCM entries are comparable, enabling assessment of pairwise‐linear dependencies. An added benefit is the computation of ‐values for pairwise uncorrelation tests, forming a heterogeneous ‐values matrix (HPM). While the HCM has been used for unsupervised feature extraction (UFE), that is, transforming features into informative representations (e.g., PCA), its application to unsupervised feature selection (UFS), that is, selecting relevant features, remains unexplored. This paper proposes two HCM‐based UFS methods for mixed‐type features. These, called UFS‐rHCM and UFS‐cHCM, iteratively remove redundant features using the HCM—row‐wise (UFS‐rHCM) or cell‐wise (UFS‐cHCM). The HPM determines the stopping point, enabling a statistically grounded approach to selecting the number of features. We also introduce a visualization tool for assessing feature importance and ranking. The performance of our methods is evaluated on simulated and real datasets.
more »
« less
- Award ID(s):
- 2209974
- PAR ID:
- 10648598
- Publisher / Repository:
- Wiley-Blackwell
- Date Published:
- Journal Name:
- International Statistical Review
- ISSN:
- 0306-7734
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Raw datasets collected for fake news detection usually contain some noise such as missing values. In order to improve the performance of machine learning based fake news detection, a novel data preprocessing method is proposed in this paper to process the missing values. Specifically, we have successfully handled the missing values problem by using data imputation for both categorical and numerical features. For categorical features, we imputed missing values with the most frequent value in the columns. For numerical features, the mean value of the column is used to impute numerical missing values. In addition, TF-IDF vectorization is applied in feature extraction to filter out irrelevant features. Experimental results show that Multi-Layer Perceptron (MLP) classifier with the proposed data preprocessing method outperforms baselines and improves the prediction accuracy by more than 15%.more » « less
-
This paper proposes to enable deep learning for generic machine learning tasks. Our goal is to allow deep learning to be applied to data which are already represented in instance feature tabular format for a better classification accuracy. Because deep learning relies on spatial/temporal correlation to learn new feature representation, our theme is to convert each instance of the original dataset into a synthetic matrix format to take the full advantage of the feature learning power of deep learning methods. To maximize the correlation of the matrix , we use 0/1 optimization to reorder features such that the ones with strong correlations are adjacent to each other. By using a two dimensional feature reordering, we are able to create a synthetic matrix, as an image, to represent each instance. Because the synthetic image preserves the original feature values and data correlation, existing deep learning algorithms, such as convolutional neural networks (CNN), can be applied to learn effective features for classification. Our experiments on 20 generic datasets, using CNN as the deep learning classifier, confirm that enabling deep learning to generic datasets has clear performance gain, compared to generic machine learning methods. In addition, the proposed method consistently outperforms simple baselines of using CNN for generic dataset. As a result, our research allows deep learning to be broadly applied to generic datasets for learning and classificationmore » « less
-
All features of any data type are universally equipped with categorical nature revealed through histograms. A contingency table framed by two histograms affords directional and mutual associations based on rescaled conditional Shannon entropies for any feature-pair. The heatmap of the mutual association matrix of all features becomes a roadmap showing which features are highly associative with which features. We develop our data analysis paradigm called categorical exploratory data analysis (CEDA) with this heatmap as a foundation. CEDA is demonstrated to provide new resolutions for two topics: multiclass classification (MCC) with one single categorical response variable and response manifold analytics (RMA) with multiple response variables. We compute visible and explainable information contents with multiscale and heterogeneous deterministic and stochastic structures in both topics. MCC involves all feature-group specific mixing geometries of labeled high-dimensional point-clouds. Upon each identified feature-group, we devise an indirect distance measure, a robust label embedding tree (LET), and a series of tree-based binary competitions to discover and present asymmetric mixing geometries. Then, a chain of complementary feature-groups offers a collection of mixing geometric pattern-categories with multiple perspective views. RMA studies a system’s regulating principles via multiple dimensional manifolds jointly constituted by targeted multiple response features and selected major covariate features. This manifold is marked with categorical localities reflecting major effects. Diverse minor effects are checked and identified across all localities for heterogeneity. Both MCC and RMA information contents are computed for data’s information content with predictive inferences as by-products. We illustrate CEDA developments via Iris data and demonstrate its applications on data taken from the PITCHf/x database.more » « less
-
Manufacturing process signatures reflect the process stability and anomalies that potentially lead to detrimental effects on the manufactured outcomes. Sensing technologies, especially in-situ image sensors, are widely used to capture process signatures for diagnostics and prognostics. This imaging data is crucial evidence for process signature characterization and monitoring. A critical aspect of process signature analysis is identifying the unique patterns in an image that differ from the generic behavior of the manufacturing process in order to detect anomalies. It is equivalent to separating the “unique features” and process-wise (or phase-wise) “shared features” from the same image and recognizing the transient anomaly, i.e., recognizing the outlier “unique features”. In state-of-the-art literature, image-based process signature analysis relies on conventional feature extraction procedures, which limit the “view” of information to each image and cannot decouple the shared and unique features. Consequently, the features extracted are less interpretable, and the anomaly detection method cannot distinguish the abnormality in the current process signature from the process-wise evolution. Targeting this limitation, this study proposes personalized feature extraction (PFE) to decouple process-wise shared features and transient unique features from a sensor image and further develops process signature characterization and anomaly detection strategies. The PFE algorithm is designed for heterogeneous data with shared features. Supervised and unsupervised anomaly detection strategies are developed upon PFE features to remove the shared features from a process signature and examine the unique features for abnormality. The proposed method is demonstrated on two datasets (i) selected data from the 2018 AM Benchmark Test Series from the National Institute of Standards and Technology (NIST), and (ii) thermal measurements in additive manufacturing of a thin-walled structure of Ti–6Al–4V. The results highlight the power of personalized modeling in extracting features from manufacturing imaging data.more » « less
An official website of the United States government
