Catastrophic forgetting undermines the effectiveness of deep neural networks (DNNs) in scenarios
such as continual learning and lifelong learning. While several methods have been proposed to
tackle this problem, there is limited work explaining why these methods work well. This paper
has the goal of better explaining a popularly used technique for avoiding catastrophic forgetting:
quadratic regularization. We show that quadratic regularizers prevent forgetting of past tasks by
interpolating current and previous values of model parameters at every training iteration. Over
multiple training iterations, this interpolation operation reduces the learning rates of more important
model parameters, thereby minimizing their movement. Our analysis also reveals two drawbacks
of quadratic regularization: (a) dependence of parameter interpolation on training hyperparameters,
which often leads to training instability and (b) assignment of lower importance to deeper layers,
which are generally the place forgetting occurs in DNNs. Via a simple modification to the order
of operations, we show these drawbacks can be easily avoided, resulting in 6.2% higher average
accuracy at 4.5% lower average forgetting. We confirm the robustness of our results by training over
2000 models in different settings.
more »
« less
Training Thinner and Deeper Neural Networks: Jumpstart Regularization
Neural networks are more expressive when they have multiple layers. In turn, conventional training methods are only successful if the depth does not lead to numerical issues such as exploding or vanishing gradients, which occur less frequently when the layers are sufficiently wide. However, increasing width to attain greater depth entails the use of heavier computational resources and leads to overparameterized models. These subsequent issues have been partially addressed by model compression methods such as quantization and pruning, some of which relying on normalization-based regularization of the loss function to make the effect of most parameters negligible. In this work, we propose instead to use regularization for preventing neurons from dying or becoming linear, a technique which we denote as jumpstart regularization. In comparison to conventional training, we obtain neural networks that are thinner, deeper, and - most importantly - more parameter-efficient.
more »
« less
- Award ID(s):
- 2104583
- PAR ID:
- 10315359
- Date Published:
- Journal Name:
- International Conference on the Integration of Constraint Programming, Artificial Intelligence, and Operations Research
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Deep-learning models have become pervasive tools in science and engineering. However, their energy requirements now increasingly limit their scalability 1 . Deep-learning accelerators 2–9 aim to perform deep learning energy-efficiently, usually targeting the inference phase and often by exploiting physical substrates beyond conventional electronics. Approaches so far 10–22 have been unable to apply the backpropagation algorithm to train unconventional novel hardware in situ. The advantages of backpropagation have made it the de facto training method for large-scale neural networks, so this deficiency constitutes a major impediment. Here we introduce a hybrid in situ–in silico algorithm, called physics-aware training, that applies backpropagation to train controllable physical systems. Just as deep learning realizes computations with deep neural networks made from layers of mathematical functions, our approach allows us to train deep physical neural networks made from layers of controllable physical systems, even when the physical layers lack any mathematical isomorphism to conventional artificial neural network layers. To demonstrate the universality of our approach, we train diverse physical neural networks based on optics, mechanics and electronics to experimentally perform audio and image classification tasks. Physics-aware training combines the scalability of backpropagation with the automatic mitigation of imperfections and noise achievable with in situ algorithms. Physical neural networks have the potential to perform machine learning faster and more energy-efficiently than conventional electronic processors and, more broadly, can endow physical systems with automatically designed physical functionalities, for example, for robotics 23–26 , materials 27–29 and smart sensors 30–32 .more » « less
-
null (Ed.)Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image classification with common deep neural network architectures. We find that if we do not regularize explicitly, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data, before switching to properly training with the correct labels. In contrast, we find that in the presence of explicit regularization, pretraining with random labels has no detrimental effect on SGD. We believe that our results give evidence that explicit regularization plays a far more important role in the success of overparameterized neural networks than what has been understood until now. Specifically, by penalizing complicated models independently of their fit to the data, regularization affects training dynamics also far away from optima, making simple models that fit the data well discoverable by local methods, such as SGD.more » « less
-
null (Ed.)Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "η-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.more » « less
-
null (Ed.)Material and biological sciences frequently generate large amounts of microscope data that require 3D object level segmentation. Often, the objects of interest have a common geometry, for example spherical, ellipsoidal, or cylindrical shapes. Neural networks have became a popular approach for object detection but they are often limited by their training dataset and have difficulties adapting to new data. In this paper, we propose a volumetric object detection approach for microscopy volumes comprised of fibrous structures by using deep centroid regression and geometric regularization. To this end, we train encoder-decoder networks for segmentation and centroid regression. We use the regression information combined with prior system knowledge to propose cylindrical objects and enforce geometric regularization in the segmentation. We train our networks on synthetic data and then test the trained networks in several experimental datasets. Our approach shows competitive results against other 3D segmentation methods when tested on the synthetic data and outperforms those other methods across different datasets.more » « less