skip to main content


Title: Nonlinear feature selection using sparsity-promoted centroid-encoder
Abstract

The contribution of our work is two-fold. First, we propose a novel feature selection technique, sparsity-promoted centroid-encoder (SCE). The model uses the nonlinear mapping of artificial neural networks to reconstruct a sample as its class centroid and, at the same time, apply a1-penalty to the weights of a sparsity promoting layer, placed between the input and first hidden layer, to select discriminative features from input data. Using the proposed method, we designed a feature selection framework that first ranks each feature and then, compiles the optimal set using validation samples. The second part of our study investigates the role of stochastic optimization, such as Adam, in minimizing1-norm. The empirical analysis shows that the hyper-parameters of Adam (mini-batch size, learning rate, etc.) play a crucial role in promoting feature sparsity by SCE. We apply our technique to numerous real-world data sets and find that it significantly outperforms other state-of-the-art methods, including LassoNet, stochastic gates (STG), feature selection networks (FsNet), supervised concrete autoencoder (CAE), deep feature selection (DFS), and random forest (RF).

 
more » « less
PAR ID:
10468830
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Neural Computing and Applications
Volume:
35
Issue:
29
ISSN:
0941-0643
Format(s):
Medium: X Size: p. 21883-21902
Size(s):
p. 21883-21902
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    We propose a new method for supervised learning, the “principal components lasso” (“pcLasso”). It combines the lasso (1) penalty with a quadratic penalty that shrinks the coefficient vector toward the feature matrix's leading principal components (PCs). pcLasso can be especially powerful if the features are preassigned to groups. In that case, pcLasso shrinks each group‐wise component of the solution toward the leading PCs of that group. The pcLasso method also carries out selection of feature groups. We provide some theory and illustrate the method on some simulated and real data examples.

     
    more » « less
  2. Understanding the fundamental mechanism behind the success of deep neural networks is one of the key challenges in the modern machine learning literature. Despite numerous attempts, a solid theoretical analysis is yet to be developed. In this paper, we develop a novel unified framework to reveal a hidden regularization mechanism through the lens of convex optimization. We first show that the training of multiple threelayer ReLU sub-networks with weight decay regularization can be equivalently cast as a convex optimization problem in a higher dimensional space, where sparsity is enforced via a group `1- norm regularization. Consequently, ReLU networks can be interpreted as high dimensional feature selection methods. More importantly, we then prove that the equivalent convex problem can be globally optimized by a standard convex optimization solver with a polynomial-time complexity with respect to the number of samples and data dimension when the width of the network is fixed. Finally, we numerically validate our theoretical results via experiments involving both synthetic and real datasets. 
    more » « less
  3. Sparsification of neural networks is one of the effective complexity reduction methods to improve efficiency and generalizability. We consider the problem of learning a one hidden layer convolutional neural network with ReLU activation function via gradient descent under sparsity promoting penalties. It is known that when the input data is Gaussian distributed, no-overlap networks (without penalties) in regression problems with ground truth can be learned in polynomial time at high probability. We propose a relaxed variable splitting method integrating thresholding and gradient descent to overcome the non-smoothness in the loss function. The sparsity in network weight is realized during the optimization (training) process. We prove that under L1, L0, and transformed-L1 penalties, no-overlap networks can be learned with high probability, and the iterative weights converge to a global limit which is a transformation of the true weight under a novel thresholding operation. Numerical experiments confirm theoretical findings, and compare the accuracy and sparsity trade-off among the penalties. 
    more » « less
  4. Abstract

    Deep learning methods hold strong promise for identifying biomarkers for clinical application. However, current approaches for psychiatric classification or prediction do not allow direct interpretation of original features. In the present study, we introduce a sparse deep neural network (DNN) approach to identify sparse and interpretable features for schizophrenia (SZ) case–control classification. AnL0‐norm regularization is implemented on the input layer of the network for sparse feature selection, which can later be interpreted based on importance weights. We applied the proposed approach on a large multi‐study cohort with gray matter volume (GMV) and single nucleotide polymorphism (SNP) data for SZ classification. A total of 634 individuals served as training samples, and the classification model was evaluated for generalizability on three independent datasets of different scanning protocols (N= 394, 255, and 160, respectively). We examined the classification power of pure GMV features, as well as combined GMV and SNP features. Empirical experiments demonstrated that sparse DNN slightly outperformed independent component analysis + support vector machine (ICA + SVM) framework, and more effectively fused GMV and SNP features for SZ discrimination, with an average error rate of 28.98% on external data. The importance weights suggested that the DNN model prioritized to select frontal and superior temporal gyrus for SZ classification with high sparsity, with parietal regions further included with lower sparsity, echoing previous literature. The results validate the application of the proposed approach to SZ classification, and promise extended utility on other data modalities and traits which ultimately may result in clinically useful tools.

     
    more » « less
  5. Unstructured neural network pruning is an effective technique that can significantly reduce theoretical model size, computation demand and energy consumption of large neural networks without compromising accuracy. However, a number of fundamental questions about pruning are not answered yet. For example, do the pruned neural networks contain the same representations as the original network? Is pruning a compression or evolution process? Does pruning only work on trained neural networks? What is the role and value of the uncovered sparsity structure? In this paper, we strive to answer these questions by analyzing three unstructured pruning methods (magnitude based pruning, post-pruning re-initialization, and random sparse initialization). We conduct extensive experiments using the Singular Vector Canonical Correlation Analysis (SVCCA) tool to study and contrast layer representations of pruned and original ResNet, VGG, and ConvNet models. We have several interesting observations: 1) Pruned neural network models evolve to substantially different representations while still maintaining similar accuracy. 2) Initialized sparse models can achieve reasonably good accuracy compared to well engineered pruning methods. 3) Sparsity structures discovered by pruning models are not inherently important or useful. 
    more » « less