skip to main content


Title: Structured Sparsity of Convolutional Neural Networks via Nonconvex Sparse Group Regularization
Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter- and intra-group sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed ℓ1, ℓ1 − ℓ2, and ℓ0) that induces sparsity onto the individual weights and ℓ2,1 regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art.  more » « less
Award ID(s):
1854434 1924548 1952644
NSF-PAR ID:
10249353
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Tabacu, Lucia
Date Published:
Journal Name:
Frontiers in applied mathematics and statistics
ISSN:
2297-4687
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Convolutional neural networks (CNN) have been hugely successful recently with superior accuracy and performance in various imaging applications, such as classification, object detection, and segmentation. However, a highly accurate CNN model requires millions of parameters to be trained and utilized. Even to increase its performance slightly would require significantly more parameters due to adding more layers and/or increasing the number of filters per layer. Apparently, many of these weight parameters turn out to be redundant and extraneous, so the original, dense model can be replaced by its compressed version attained by imposing inter- and intra-group sparsity onto the layer weights during training. In this paper, we propose a nonconvex family of sparse group lasso that blends nonconvex regularization (e.g., transformed L1, L1 - L2, and L0) that induces sparsity onto the individual weights and L2,1 regularization onto the output channels of a layer. We apply variable splitting onto the proposed regularization to develop an algorithm that consists of two steps per iteration: gradient descent and thresholding. Numerical experiments are demonstrated on various CNN architectures showcasing the effectiveness of the nonconvex family of sparse group lasso in network sparsification and test accuracy on par with the current state of the art. 
    more » « less
  2. Edge machine learning can deliver low-latency and private artificial intelligent (AI) services for mobile devices by leveraging computation and storage resources at the network edge. This paper presents an energy-efficient edge processing framework to execute deep learning inference tasks at the edge computing nodes whose wireless connections to mobile devices are prone to channel uncertainties. Aimed at minimizing the sum of computation and transmission power consumption with probabilistic quality-of-service (QoS) constraints, we formulate a joint inference tasking and downlink beamforming problem that is characterized by a group sparse objective function. We provide a statistical learning based robust optimization approach to approximate the highly intractable probabilistic-QoS constraints by nonconvex quadratic constraints, which are further reformulated as matrix inequalities with a rank-one constraint via matrix lifting. We design a reweighted power minimization approach by iteratively reweighted ℓ1 minimization with difference-of-convex-functions (DC) regularization and updating weights, where the reweighted approach is adopted for enhancing group sparsity whereas the DC regularization is designed for inducing rank-one solutions. Numerical results demonstrate that the proposed approach outperforms other state-of-the-art approaches. 
    more » « less
  3. Objectively differentiating patient mental states based on electrical activity, as opposed to overt behavior, is a fundamental neuroscience problem with medical applications, such as identifying patients in locked-in state vs. coma. Electroencephalography (EEG), which detects millisecond-level changes in brain activity across a range of frequencies, allows for assessment of external stimulus processing by the brain in a non-invasive manner. We applied machine learning methods to 26-channel EEG data of 24 fluent Deaf signers watching videos of sign language sentences (comprehension condition), and the same videos reversed in time (non-comprehension condition), to objectively separate vision-based high-level cognition states. While spectrotemporal parameters of the stimuli were identical in comprehension vs. non-comprehension conditions, the neural responses of participants varied based on their ability to linguistically decode visual data. We aimed to determine which subset of parameters (specific scalp regions or frequency ranges) would be necessary and sufficient for high classification accuracy of comprehension state. Optical flow, characterizing distribution of velocities of objects in an image, was calculated for each pixel of stimulus videos using MATLAB Vision toolbox. Coherence between optical flow in the stimulus and EEG neural response (per video, per participant) was then computed using canonical component analysis with NoiseTools toolbox. Peak correlations were extracted for each frequency for each electrode, participant, and video. A set of standard ML algorithms were applied to the entire dataset (26 channels, frequencies from .2 Hz to 12.4 Hz, binned in 1 Hz increments), with consistent out-of-sample 100% accuracy for frequencies in .2-1 Hz range for all regions, and above 80% accuracy for frequencies < 4 Hz. Sparse Optimal Scoring (SOS) was then applied to the EEG data to reduce the dimensionality of the features and improve model interpretability. SOS with elastic-net penalty resulted in out-of-sample classification accuracy of 98.89%. The sparsity pattern in the model indicated that frequencies between 0.2–4 Hz were primarily used in the classification, suggesting that underlying data may be group sparse. Further, SOS with group lasso penalty was applied to regional subsets of electrodes (anterior, posterior, left, right). All trials achieved greater than 97% out-of-sample classification accuracy. The sparsity patterns from the trials using 1 Hz bins over individual regions consistently indicated frequencies between 0.2–1 Hz were primarily used in the classification, with anterior and left regions performing the best with 98.89% and 99.17% classification accuracy, respectively. While the sparsity pattern may not be the unique optimal model for a given trial, the high classification accuracy indicates that these models have accurately identified common neural responses to visual linguistic stimuli. Cortical tracking of spectro-temporal change in the visual signal of sign language appears to rely on lower frequencies proportional to the N400/P600 time-domain evoked response potentials, indicating that visual language comprehension is grounded in predictive processing mechanisms. 
    more » « less
  4. Sparse regression and feature extraction are the cornerstones of knowledge discovery from massive data. Their goal is to discover interpretable and predictive models that provide simple relationships among scientific variables. While the statistical tools for model discovery are well established in the context of linear regression, their generalization to nonlinear regression in material modeling is highly problem‐specific and insufficiently understood. Here we explore the potential of neural networks for automatic model discovery and induce sparsity by a hybrid approach that combines two strategies: regularization and physical constraints. We integrate the concept of Lp regularization for subset selection with constitutive neural networks that leverage our domain knowledge in kinematics and thermodynamics. We train our networks with both, synthetic and real data, and perform several thousand discovery runs to infer common guidelines and trends: L2 regularization or ridge regression is unsuitable for model discovery; L1 regularization or lasso promotes sparsity, but induces strong bias that may aggressively change the results; only L0 regularization allows us to transparently fine‐tune the trade‐off between interpretability and predictability, simplicity and accuracy, and bias and variance. With these insights, we demonstrate that Lp regularized constitutive neural networks can simultaneously discover both, interpretable models and physically meaningful parameters. We anticipate that our findings will generalize to alternative discovery techniques such as sparse and symbolic regression, and to other domains such as biology, chemistry, or medicine. Our ability to automatically discover material models from data could have tremendous applications in generative material design and open new opportunities to manipulate matter, alter properties of existing materials, and discover new materials with user‐defined properties. 
    more » « less
  5. We study sparsification of convolutional neural networks (CNN) by a relaxed variable splitting method of ℓ0 and transformed-ℓ1 (Tℓ1) penalties, with application to complex curves such as texts written in different fonts, and words written with trembling hands simulating those of Parkinson’s disease patients. The CNN contains 3 convolutional layers, each followed by a maximum pooling, and finally a fully connected layer which contains the largest number of network weights. With ℓ0 penalty, we achieved over 99% test accuracy in distinguishing shaky vs. regular fonts or hand writings with above 86% of the weights in the fully connected layer being zero. Comparable sparsity and test accuracy are also reached with a proper choice of Tℓ1 penalty. 
    more » « less