Catastrophic forgetting undermines the effectiveness of deep neural networks (DNNs) in scenarios such as continual learning and lifelong learning. While several methods have been proposed to tackle this problem, there is limited work explaining why these methods work well. This paper has the goal of better explaining a popularly used technique for avoiding catastrophic forgetting: quadratic regularization. We show that quadratic regularizers prevent forgetting of past tasks by interpolating current and previous values of model parameters at every training iteration. Over multiple training iterations, this interpolation operation reduces the learning rates of more important model parameters, thereby minimizing their movement. Our analysis also reveals two drawbacks of quadratic regularization: (a) dependence of parameter interpolation on training hyperparameters, which often leads to training instability and (b) assignment of lower importance to deeper layers, which are generally the place forgetting occurs in DNNs. Via a simple modification to the order of operations, we show these drawbacks can be easily avoided, resulting in 6.2% higher average accuracy at 4.5% lower average forgetting. We confirm the robustness of our results by training over 2000 models in different settings.
more »
« less
Training Thinner and Deeper Neural Networks: Jumpstart Regularization
Neural networks are more expressive when they have multiple layers. In turn, conventional training methods are only successful if the depth does not lead to numerical issues such as exploding or vanishing gradients, which occur less frequently when the layers are sufficiently wide. However, increasing width to attain greater depth entails the use of heavier computational resources and leads to overparameterized models. These subsequent issues have been partially addressed by model compression methods such as quantization and pruning, some of which relying on normalization-based regularization of the loss function to make the effect of most parameters negligible. In this work, we propose instead to use regularization for preventing neurons from dying or becoming linear, a technique which we denote as jumpstart regularization. In comparison to conventional training, we obtain neural networks that are thinner, deeper, and - most importantly - more parameter-efficient.
more »
« less
- PAR ID:
- 10315359
- Date Published:
- Journal Name:
- International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Deep-learning models have become pervasive tools in science and engineering. However, their energy requirements now increasingly limit their scalability 1 . Deep-learning accelerators 2–9 aim to perform deep learning energy-efficiently, usually targeting the inference phase and often by exploiting physical substrates beyond conventional electronics. Approaches so far 10–22 have been unable to apply the backpropagation algorithm to train unconventional novel hardware in situ. The advantages of backpropagation have made it the de facto training method for large-scale neural networks, so this deficiency constitutes a major impediment. Here we introduce a hybrid in situ–in silico algorithm, called physics-aware training, that applies backpropagation to train controllable physical systems. Just as deep learning realizes computations with deep neural networks made from layers of mathematical functions, our approach allows us to train deep physical neural networks made from layers of controllable physical systems, even when the physical layers lack any mathematical isomorphism to conventional artificial neural network layers. To demonstrate the universality of our approach, we train diverse physical neural networks based on optics, mechanics and electronics to experimentally perform audio and image classification tasks. Physics-aware training combines the scalability of backpropagation with the automatic mitigation of imperfections and noise achievable with in situ algorithms. Physical neural networks have the potential to perform machine learning faster and more energy-efficiently than conventional electronic processors and, more broadly, can endow physical systems with automatically designed physical functionalities, for example, for robotics 23–26 , materials 27–29 and smart sensors 30–32 .more » « less
-
Freezing layers in deep neural networks has been shown to enhance generalization and accelerate training, yet the underlying mechanisms remain unclear. This paper investigates the impact of frozen layers from the perspective of linear separability, examining how untrained, randomly initialized layers influence feature representations and model performance. Using multilayer perceptrons trained on MNIST, CIFAR-10, and CIFAR-100, we systematically analyze the effects freezing layers and network architecture. While prior work attributes the benefits of frozen layers to Cover’s theorem, which suggests that nonlinear transformations improve linear separability, we find that this explanation is insufficient. Instead, our results indicate that the observed improvements in generalization and convergence stem from other mechanisms. We hypothesize that freezing may have similar effects to other regularization techniques and that it may smooth the loss landscape to facilitate training. Furthermore, we identify key architectural factors---such as network overparameterization and use of skip connections---that modulate the effectiveness of frozen layers. These findings offer new insights into the conditions under which freezing layers can optimize deep learning performance, informing future work on neural architecture search.more » « less
-
null (Ed.)Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image classification with common deep neural network architectures. We find that if we do not regularize explicitly, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data, before switching to properly training with the correct labels. In contrast, we find that in the presence of explicit regularization, pretraining with random labels has no detrimental effect on SGD. We believe that our results give evidence that explicit regularization plays a far more important role in the success of overparameterized neural networks than what has been understood until now. Specifically, by penalizing complicated models independently of their fit to the data, regularization affects training dynamics also far away from optima, making simple models that fit the data well discoverable by local methods, such as SGD.more » « less
-
null (Ed.)Material and biological sciences frequently generate large amounts of microscope data that require 3D object level segmentation. Often, the objects of interest have a common geometry, for example spherical, ellipsoidal, or cylindrical shapes. Neural networks have became a popular approach for object detection but they are often limited by their training dataset and have difficulties adapting to new data. In this paper, we propose a volumetric object detection approach for microscopy volumes comprised of fibrous structures by using deep centroid regression and geometric regularization. To this end, we train encoder-decoder networks for segmentation and centroid regression. We use the regression information combined with prior system knowledge to propose cylindrical objects and enforce geometric regularization in the segmentation. We train our networks on synthetic data and then test the trained networks in several experimental datasets. Our approach shows competitive results against other 3D segmentation methods when tested on the synthetic data and outperforms those other methods across different datasets.more » « less
An official website of the United States government

