<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Dropout as an implicit gating mechanism for continual learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>06/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10222895</idno>
					<idno type="doi">10.1109/CVPRW50498.2020.00124</idno>
					<title level='j'>In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Seyed Iman Mirzadeh</author><author>Mehrdad Farajtabar</author><author>Hassan Ghasemzadeh</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In recent years, neural networks have demonstrated an outstanding ability to achieve complex learning tasks across various domains. However, they suffer from the "catastrophic forgetting" problem when they face a sequence of learning tasks, where they forget the old ones as they learn new tasks. This problem is also highly related to the "stability-plasticity dilemma". The more plastic the network, the easier it can learn new tasks, but the faster it also forgets previous ones. Conversely, a stable network cannot learn new tasks as fast as a very plastic network. However, it is more reliable to preserve the knowledge it has learned from the previous tasks. Several solutions have been proposed to overcome the forgetting problem by making the neural network parameters more stable, and some of them have mentioned the significance of dropout in continual learning. However, their relationship has not been sufficiently studied yet. In this paper, we investigate this relationship and show that a stable network with dropout learns a gating mechanism such that for different tasks, different paths of the network are active. Our experiments show that the stability achieved by this implicit gating plays a very critical role in leading to performance comparable to or better than other involved continual learning algorithms to overcome catastrophic forgetting.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>The stability-plasticity dilemma is a well-known problem for both artificial and biological neural networks <ref type="bibr">[19]</ref>. Intelligent systems need plasticity to learn new knowledge and adapt to new environments while they require stability to prevent forgetting previous knowledge. If a network is very plastic but not stable, it can learn new tasks faster, but it also forgets the previous ones easily. This is known as the catastrophic forgetting problem <ref type="bibr">[18]</ref>. On the other hand, a network can be very stable and preserves the knowledge of the previous tasks, but it cannot easily adapt to unseen 1 The code and the appendix is available at:</p><p><ref type="url">https://github.com/imirzadeh/stable-continual-learning</ref>  We motivate our paper by illustrating the stabilityplasticity dilemma in a standard continual learning dataset in Figure <ref type="figure">1</ref>. The tasks in this dataset are generated by continually rotating the MNIST digits. The red and blue lines represent the two algorithms, respectively: (1) Stochastic Gradient Descent (SGD) with Dropout <ref type="bibr">[10]</ref> and (2) SGD without Dropout. The network trained without dropout can quickly pick up new tasks (plasticity); however, forgets previous ones as we move forward to subsequent tasks. On the other hand, the network that is trained with dropout retains the previous knowledge significantly better (stability) by paying a small cost of performance drop.</p><p>To the best of our knowledge, the work by <ref type="bibr">[8]</ref> is the first to empirically study the importance of the dropout technique in the continual learning setting. They hypothesize that dropout increases the optimal size of the network by regularizing and constraining the capacity to be just barely sufficient to perform the first task. However, by observing some inconsistent results on dissimilar tasks, they suggested dropout may have other beneficial effects too. More recently, the effectiveness of dropout is demonstrated in a comprehensive study on several architectures and datasets <ref type="bibr">[13,</ref><ref type="bibr">26]</ref>. However, many important questions about the relationship between the dropout method and catastrophic forgetting are unanswered. One such ques-tion is "How does the dropout help the network to overcome the catastrophic forgetting besides regularization?". It is well established that dropout works as a regularizer <ref type="bibr">[25]</ref>. But, several other regularizers (e.g., L2 norm) fail to help the network in a continual learning setting <ref type="bibr">[12]</ref>.</p><p>In this paper, we analyze the impact of dropout on network stability and study its behavior in the presence of dropout. We show that the dropout networks behave like a network with a gating mechanism, and the crated taskspecific pathways are retained and consistent during the sequential learning of tasks. Finally, we show that training with dropout gives a stable and flexible network that outperforms several other methods when they do not use dropout even if they are equipped with an external memory of previous examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Several continual learning methods have been proposed to tackle catastrophic forgetting. We can categorize these algorithms into three general groups, followed by by <ref type="bibr">[13]</ref>.</p><p>The first group consists of replay based methods that build and store a memory of the knowledge learned from old tasks, known as experience replay. iCaRL <ref type="bibr">[23]</ref> learns in a class-incremental way by having a fixed memory that stores samples that are close to the center of each class. Averaged Gradient Episodic Memory (A-GEM) <ref type="bibr">[6]</ref> is another example of these methods which build a dynamic episodic memory of parameter gradients during the learning process. Very recently, the Hindsight Anchor Learning (HAL) <ref type="bibr">[5]</ref> proposed to keep some "anchor" points of past tasks and use these points to update knowledge on the current task.</p><p>The methods in the second group use explicit regularization techniques to supervise the learning algorithm such that the network parameters are consistent during the learning process. Elastic weight consolidation (EWC) <ref type="bibr">[12]</ref> uses the Fisher information matrix as a proxy for weights' importance and guide the gradient updates. Orthogonal Gradient Descent (OGD) <ref type="bibr">[7]</ref> uses the projection of the prediction gradients from new tasks on the subspace of previous tasks' gradients to protect the learned knowledge. The idea of using knowledge distillation <ref type="bibr">[11,</ref><ref type="bibr">20,</ref><ref type="bibr">21]</ref> is also found to be a successful regularizer in several works <ref type="bibr">[15,</ref><ref type="bibr">14]</ref>.</p><p>Finally, in parameter isolation methods, in addition to potentially a shared part, different subsets of the model parameters are dedicated to each task. This approach can be viewed as a flexible gating mechanism, which enhances stability and controls the plasticity by activating different gates for each task. <ref type="bibr">[17]</ref> proposes a neuroscience-inspired method for a context-dependent gating signal, such that only sparse, mostly non-overlapping patterns of units are active for any one task. PackNet <ref type="bibr">[16]</ref> implements a controlled version of gating by using network pruning techniques to free up parameters after finishing each task and thus sequentially "pack" multiple tasks into a single network. Gating mechanisms found to be very efficient in several works. In the comprehensive set of experiments, Pack-Net is shown one of the most reliable methods <ref type="bibr">[13]</ref> and adding the context-dependent-gates to other methods such as EWC improved their performance drastically <ref type="bibr">[17]</ref>.</p><p>In the following sections, we show that a stable network trained with dropout will learn a reliable gates mechanism. We note that the majority of the mentioned methods need extra computation and memory costs, while a stable dropout network is a much more memory and computation efficient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Dropout and Network Stability</head><p>Dropout <ref type="bibr">[10]</ref> is a well-established technique in deep learning, which is also well-studied theoretically <ref type="bibr">[2,</ref><ref type="bibr">24,</ref><ref type="bibr">9,</ref><ref type="bibr">25]</ref>. It was originally designed to prevent the co-adaptation of neurons in a network. It increases the stability of neural networks and has been employed successfully in various domains. In the training phase of a dropout network, at each example presentation, neurons are deleted with probability 1-p, and the network will be trained in a standard way. For the inference phase, the weights are re-scaled proportional to the dropout probability.</p><p>There are various viewpoints to dropout. In this paper, we are interested in regarding dropout as a method for sparse coding and regularization and leveraging the associated theoretical insights to study the relationship between dropout and continual learning.</p><p>Consider the neuron i of the layer h in a neural network and define the activity of the neuron by:</p><p>where, I is the input vector and w hl ij represents weight from neuron j of layer l to neuron i of layer h. &#948; l j is the gating binary Bernoulli random variable which is indicating if the neuron is disabled by the dropout (i.e., P (&#948; l j = 1) = p l j ) or not. Under the assumption that &#948; l j 's are independent, and dropout has not been applied to previous layers, <ref type="bibr">[2]</ref> showed that if we apply dropout to layer h the variance of the activation for each neuron follows:</p><p>Where &#963;(S l j ) denotes the output of neuron j at previous layer l. Therefore, to obtain a stable activation behavior, the variance of the activation of a neuron should be minimized. This happens if p l j is close to either 0 or 1 (so p l j (p l j -1) will be small). Note that we can not directly control w as it will be updated by the loss function objective.</p><p>One consequence of Equation ( <ref type="formula">2</ref>) is that in a stable dropout network, the neural activation is very sparse. This yields to a skewed asymmetric distribution for neuron activity inside a network <ref type="bibr">[2]</ref>. This skewed asymmetric distribution has close connections to the gating mechanism. Such a distribution for the neural activity of several animals is believed to be responsible for an optimal trade-off between stability and plasticity <ref type="bibr">[3]</ref>. This firing pattern implements a gating mechanism inside the brain that is not only plastic enough to learn new tasks but also stable enough to preserve the knowledge it has learned from different tasks. <ref type="bibr">[3]</ref> showed the neural activity of several biological brains, which is in line with the neural activity in dropout networks, as shown by <ref type="bibr">[2]</ref> and our experiments.</p><p>Training with dropout also has another consequence: dropout most heavily regularizes the neurons that contribute to uncertain predictions (i.e., semi-active neurons that are not close to either 0 or 1) <ref type="bibr">[25,</ref><ref type="bibr">2]</ref>. Intuitively, for a network of gates and switches, it means that dropout regularization pushes neurons to be either active or deactivate. EWC <ref type="bibr">[12]</ref> also is built upon the same intuition of penalizing the changes to certain weights and allowing the less certain parameters to handle learning new tasks. When the model finishes task t and reaches task t + 1, this regularization would create new gates by either enabling or disabling such neurons. Decaying the learning rate during the continual learning experience also helps dropout increase the model stability since by preserving the gates for a longer time.</p><p>In conclusion, dropout regularization helps to create gates in the network by pushing the neurons to be either highly active or highly inactive during the learning experience. In addition, when facing new tasks, the regularization mechanism will change the semi-active neurons more compared to active or inactive neurons, which helps to preserve the task-specific pathways when learning subsequent tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Datasets</head><p>We perform our experiments on two standard continual learning benchmarks: Permuted MNIST <ref type="bibr">[8]</ref>, and Rotated MNIST. Each task of the permuted MNIST dataset is generated by shuffling the pixels of images such that the permutation would be the same between images of the same task but is different across the tasks. Each permutation is chosen randomly; thus, the difficulty of tasks is the same. We used the first task to be the original MNIST images. Rotated MNIST is generated by the continual rotating of the original MNIST images. Here, task 1 is to classify standard MNIST digits, and each subsequent task will rotate the previous task's images by 10 degrees (e.g., task 2 by 10 degrees, task 3 by 20 degrees, and so on).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Training Settings</head><p>In this section, we first describe our training setting for Sections 5.1, 5.2, and 5.4. We use PyTorch <ref type="bibr">[22]</ref> for the implementation of all experiments and reported the average and standard deviation of the validation accuracy for five runs. For all experiments, we use a multi-layer perceptron (MLP) with two hidden layers, each with 100 ReLU neurons. Moreover, each network is only trained on each task for five epochs to be consistent with several other benchmarks <ref type="bibr">[7]</ref>. We compare the standard SGD training with Elastic Weight Consolidation (EWC) <ref type="bibr">[12]</ref>, A-GEM <ref type="bibr">[6]</ref>, and Orthogonal Gradient Descent (OGD) <ref type="bibr">[7]</ref>. Multi-Task Learning (MTL) serves as an upper bound and that the network is trained in a multi-task setting (i.e., data of previous tasks are always available and used for training). All the results except the SGD with dropout were directly cited from <ref type="bibr">[7]</ref> as datasets, training epochs, and optimizers were the same. For SGD with dropout, we use the batch size of 64 and the standard SGD optimizer with a learning rate of 0.01 and 0.8 for momentum. Furthermore, we found that learning rate decay helps network stability dramatically, and we reduced the learning rate by 0.8 after finishing each task. We have experimented with different dropout probabilities and found that values between 0.2 and 0.6 work well. However, for simplicity, we have used 0.5 for the dropout probability for all reported results unless stated otherwise. We note that all the methods except the SGD+Dropout are trained without dropout and learning rate decay since our main goal is to measure the performance gain of the methods that are not due to these stability techniques.</p><p>For our scaled experiment (Section 5.3), we extend the number of tasks to 20 rather 5 to verify our analysis holds. We used a two-layer MLP with 256 ReLU neurons in each layer. For each task, the network will be trained for 5 epochs. The dropout parameter and learning rate decay will remain the same as the previous section. For this experiment, we use two metrics from <ref type="bibr">[4,</ref><ref type="bibr">6]</ref> to evaluate continual learning algorithms when the number of tasks is large:</p><p>1. Average Accuracy: The average validation accuracy after the model has been trained sequentially up to task t, defined by:</p><p>where, a t,i is the validation accuracy on dataset i when the model finished learning task t.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Forgetting Measure:</head><p>The average forgetting after the model has been trained sequentially on all tasks. Forgetting is defined as the decrease in performance at each of the tasks between their peak accuracy and their accuracy after the continual learning experience has finished. For a continual learning dataset with T sequential tasks, it is defined as:</p><p>Finally, in our code repository, we provide scripts to reproduce the results with suggested hyper-parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>In this section, we perform several experiments to show the impact of dropout on model stability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Forgetting Curve in Stable Networks</head><p>In our first experiment, we show that it is feasible to increase the stability of a network by compromising its plasticity a little bit but getting a considerable amount of stability in return. Figures <ref type="figure">2</ref> and<ref type="figure">3</ref> show the evolution of validation accuracy throughout the continual learning over five tasks on permuted MNIST and rotated MNIST, respectively. For each dataset, we train networks for three different settings:</p><p>&#8226; (a) Training the network without dropout and learning rate decay and obtain a highly plastic network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226; (b)</head><p>Training the network with small dropout probability (p = 0.25) and also learning rate decay to obtain a more stable network than the one in part(a).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226; (c)</head><p>Training with moderate dropout (p = 0.5) and applying learning rate decay which yields a highly stable network.</p><p>We would like to clarify that the x-axis in both figures denotes the time in the continual learning experience. Since the learning experience consists of five tasks, each for five epochs, the x-axis time denotes the time, which would be between one and twenty-five. The reported numbers at each step are calculated by averaging the accuracy over five different runs. We can observe from Figures <ref type="figure">2</ref> and<ref type="figure">3</ref> that plastic networks in (a) learn new tasks better and faster than more stable ones, but they also forget old tasks very quickly. Networks with moderate plasticity in (b) learn slower than the highly plastic ones in (a), but they also forget at a slower rate. Finally, highly stable networks in (c) have the slowest forgetting curve thanks to the switching gates of the dropout. However, the stability comes with its cost: compromising flexibility, which yields to learning new tasks at a slower rate. We emphasize our main goal of this experiment: It is possible to obtain stable networks by compromising the right amount of plasticity, and unlike OGD <ref type="bibr">[7]</ref> and AGEM <ref type="bibr">[6]</ref>, with no additional techniques such as replay memory and correcting gradient directions. We will see in Section 5.4 that improving stability plays a much more important role than the mentioned techniques. We will show that stable networks trained with SGD can outperform other continual learning methods when they do not exploit these simple yet effective stability techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Dropout and Gating Mechanism</head><p>In this experiment, we show that training with dropout will implicitly produce different gates in the network such that for each task, only a certain subset of the network parameters is active. We counted the number of times a specific neuron was active (fired) for each task, and compare this behavior throughout the sequential learning process for all the tasks. Figure <ref type="figure">4</ref> shows the heatmap of the activation behavior of neurons of the first layer of two networks (with and without dropout) that are trained using the SGD method after finishing five permuted MNIST tasks. We indexed 100 neurons on the x-axis (from 0 to 99) and plotted the heatmap of their activation on the y-axis indexed by tasks. In other words, it represents the frequency of activation on validation data of that task. We note that the number of times that a neuron can fire for each MNIST task will be between 0 and 10000 (size of validation set). For better representation, we have normalized this number by dividing each value by 10000 so that all numbers are between 0 and 1.</p><p>The first interesting observation from Figure <ref type="figure">4</ref> is that the activity pattern of neurons of the network trained with dropout is sparser than the case without dropout. Some neurons are very active, and some very inactive. This is in contrast to the behavior of the network without dropout, where almost all the neurons are very active for all tasks. Moreover, if we focus on the behavior of a single neuron of a network with dropout, we see that the neuron is active for some tasks but is inactive for the others. Only a few of them are always active for all tasks. This behavior shows the gating mechanism of the network trained with dropout. The second interesting observation is the evolution of activation sparsity as the model learns more and more tasks. In other words, fewer and fewer neurons remain free to be activated for later tasks. This is due to the fact that the network's remaining capacity fills up as training continues.</p><p>It's notable that the gating mechanism is most useful when the pathways for a task remains consistent and almost invariant while training on subsequent tasks and so on. When the network is learning task t, dropout helps to produce some gating for the forward propagation. How- ever, if the gates for this task are not preserved throughout the sequential learning process and change while the network is learning task t + 1, then the network will forget task t. Figure <ref type="figure">5</ref> shows the activation patterns of task 1 for the first layer of a network trained with dropout, at two different times: (1) right after learning the first task (beginning of the continual learning), and (2) after learning the final task (end of the learning). As illustrated, the activation behavior and the gating is fairly consistent, and the pathways are preserved through time.</p><p>Finally, we note that although the illustrated examples are only for five tasks of permuted MNIST, the same pattern of behavior exists for the networks trained on the rotated MNIST task and when the number of tasks increases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Increasing Tasks</head><p>In this section, we show that the stability of dropout training remains effective in the case of an increased number of tasks.</p><p>Figure <ref type="figure">6</ref> compares the evolution of average accuracy (Equation ( <ref type="formula">3</ref>)) for a stable versus plastic network. The graph consists of the average and three standard deviations over five different runs. The stable networks have the final average accuracy of 78.7 (&#177;0.2) with forgetting statistic (Equation ( <ref type="formula">4</ref>)) of 0.13 (&#177;0.02) while these metrics for plastic networks are 59.2 (&#177;2.7) and 0.39 (&#177;0.03), respectively.</p><p>In the appendix section, we compare the stable dropout networks with various state of the art continual learning settings for 20 tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Comparison with Other Methods</head><p>In this experiment, we compare the stable SGD+Dropout network with several other continual learning methods. The goal is to compare the significance of the "network stability" compared to the methods that focus on the other aspects of the catastrophic forgetting to tackle this problem.</p><p>Table <ref type="table">1</ref> and 2 compare several continual learning algorithms on the permuted and rotated MNIST datasets. SGD+Dropout outperforms all the continual learning methods on old tasks and achieves an acceptable accuracy on new tasks for both permuted and rotated MNIST datasets. One interesting observation from both tables is the fact that SGD+Dropout achieves near-optimal accuracy even on new tasks but is not the best. The reason is that the network is very stable, and because of the stability-flexibility trade-off, it has lost some part of its flexibility.</p><p>Finally, we note that all the other methods except SGD, EWC, and SGD+Dropout are using some 200 data points per task to calculate gradient information from previous tasks (e.g., OGD) or in the form of episodic memory (e.g., A-GEM).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion and Future Work</head><p>In this paper, we studied the relationship between dropout and continual learning. We showed that the key to understanding this relationship is studying network sta-bility. Furthermore, our analysis and experiments demonstrated that the dropout method could be viewed as an implicit gating mechanism, which yields a stable and plastic network. Our experiments showed that the consistent gating mechanism resulted from dropout can outperform various popular continual learning methods.</p><p>The effectiveness of the dropout method suggests that focusing directly on the stability of neural networks is an effective approach to tackle catastrophic forgetting. One interesting research direction is to modify the dropout method to gain more control over the gating mechanism, possibly by exploiting the structural similarity between sequential tasks and neural activation patterns, the same as proposed ideas in the transfer learning literature <ref type="bibr">[1]</ref>. Studying the effect of dropout on network behavior in different continual learning settings is also a promising direction. Our preliminary results show that dropout networks will remain robust even when trained on an increased number of sequential tasks.</p></div></body>
		</text>
</TEI>
