<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Simple, Fast Algorithm for Continual Learning from High-Dimensional Data</title></titleStmt>
			<publicationStmt>
				<publisher>Openreview.net</publisher>
				<date>05/30/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10471249</idno>
					<idno type="doi"></idno>
					<title level='j'>Eleventh International Conference on Learning Representations</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Neil Ashtekar</author><author>Vasant G Honavar</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[As an alternative to resource-intensive deep learning approaches to the continual learning problem, we propose a simple, fast algorithm inspired by adaptive resonance theory (ART). To cope with the curse of dimensionality and avoid catastrophic forgetting, we apply incremental principal component analysis (IPCA) to the model's previously learned weights. Experiments show that this approach approximates the performance achieved using static PCA and is competitive with continual deep learning methods. Our implementation is available on https://github.com/neil-ash/ART-IPCA]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>In the continual learning setting, we wish to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks. Many approaches have been proposed to address this problem, with recent work focused mainly on deep learning <ref type="bibr">(Kirkpatrick et al., 2017;</ref><ref type="bibr">Rebuffi et al., 2017;</ref><ref type="bibr">Yoon et al., 2018)</ref>. While these approaches boast impressive predictive power, they are limited by a host of drawbacks including high computational costs, large memory requirements, and poor performance on nontrivial continual learning benchmarks with long task sequences and task-identification information withheld <ref type="bibr">(Farquhar &amp; Gal, 2018;</ref><ref type="bibr">D&#237;az-Rodr&#237;guez et al., 2018)</ref>.</p><p>We propose an alternative, matching-based algorithm which combines ideas from ART with IPCA to improve predictive performance for high-dimensional data. ART is an attractive solution to continual learning because of its proven stability guarantees which imply that catastrophic forgetting will not occur with respect to the training data <ref type="bibr">(Grossberg, 2020)</ref>. PCA is a well-known linear dimensionality reduction technique which can be effectively used as a data preprocessing step for both supervised and unsupervised learning tasks. Many incremental versions of PCA have been proposed to efficiently update principal component estimates without recomputing eigenvectors from the full covariance matrix <ref type="bibr">(Weng et al., 2003;</ref><ref type="bibr">Zhao et al., 2006;</ref><ref type="bibr">Ross et al., 2008)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">PROPOSED MODEL</head><p>We begin by formally defining our notation and problem statement. We are given a set of tasks T 1 . . . T T where each task T i has a corresponding dataset D i = (X i , y i ). We focus on supervised classification tasks with features X i &#8712; R ni&#215;d and categorical labels y i &#8712; N ni&#215;1 , though our proposed model can easily be extended to unsupervised clustering. We reduce our data's dimension from d &#8594; k using IPCA matrix P &#8712; R k&#215;d whose rows consist of the top-k estimated eigenvectors of the data's covariance matrix. The weights in the original space and IPCA reduced space are denoted by w org &#8712; R d&#215;1 and w red &#8712; R k&#215;1 respectively, with w i red = P i w org denoting the weight in reduced space under iteration i of IPCA.</p><p>Our proposed model does not require task identifiers, though this information may be optionally included to improve performance. It is applicable to class-incremental, task-incremental, and domainincremental learning, and only requires that features have consistent dimensionality across all tasks. Algorithm 1 describes the training procedure while the inference procedure and intuition behind the model are discussed in Appendix A.1. Time and space complexity are given in Appendix A.2. We answer this question by observing that the updated version of IPCA can be applied to transform the model's weights (learned in the original space) since the weights are simply a linear combination of training samples. In other words, learning in the current IPCA reduced space results in the same weights as previously learning in the original space, then applying the current version of IPCA. This argument is formalized in Proposition 1 with proof and further explanation in Appendix A.3. Proposition 1. Let x 1 . . . x m be the training samples which have matched to weight w under some previous IPCA transformation(s) P 1 . . . P i-1</p><p><ref type="foot">foot_0</ref> . Assuming that these samples x 1 . . . x m still match to weight w under the current IPCA transformation P i , then learning the weight w i red in the reduced space is equivalent to first learning the weight w org in the original space, then applying IPCA P i .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">EMPIRICAL RESULTS</head><p>We evaluate our proposed algorithm on the Split MNIST and Split Fashion-MNIST benchmarks in the single-headed setting with task identifiers unavailable at inference. Our proposed IPCA model is compared against two variations: (1) with no PCA and (2) with static PCA learned on the entire dataset (serving as expected lower and upper bounds for IPCA performance). The results in the top half of Table <ref type="table">1</ref> are consistent with these expectations. To contextualize our model's performance, we include results from <ref type="bibr">Sokar et al. (2021)</ref>  A verbal description of Algorithm 1 is as follows. Each time a new task arrives, the IPCA matrix is updated using all of the task's data. Sequentially, for each training sample in the task, the bestmatching weight is computed using a similarity metric in the IPCA reduced space of dimension k.</p><p>If the best-matching weight is sufficiently similar to the sample and it corresponds to the correct class, then the weight is updated in the original space of dimension d. Otherwise, a new weight is initialized to the current sample in the original space, with its corresponding class label recorded. At inference, simply compute similarity scores between the test sample and all learned weights in the reduced space, then predict the class corresponding to the weight with the largest similarity score.</p><p>The main idea behind our proposed algorithm is to learn weights in the original space and perform matching in the IPCA reduced space. Storing the weights in the original space allows the model to adapt to changes in the input representation caused by IPCA updates. This adaptation is done by simply applying the current version of IPCA to the previously learned weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 TIME AND SPACE COMPLEXITY</head><p>The space complexity of our model during training is O(kd + W d) where W is the total number of learned weights. During training, both the current IPCA matrix and weights in the original space must be stored. The time complexity for training is O(N kd + T W kd + N W k) given N total training samples and T tasks. This consists of the time required for learning/applying IPCA updates to the data, applying IPCA updates to the weights, and performing matching.</p><p>The space complexity of our trained model is O(kd + W k). The time complexity for inference on a single sample is also O(kd + W k). This is because the IPCA matrix and learned weights must be stored in order to transform new data and compute matches at inference. These bounds assume that the learned weights are stored in the reduced space of dimension k after all training is completethis is a small implementation detail included in the time complexity given above for training. Note that the number of weights W can be controlled by setting the vigilance parameter &#961;, with smaller values of &#961; resulting in fewer weights and coarser granularity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 APPROXIMATION OF STATIC PCA</head><p>Proposition 1 attempts to establish an equivalence between learning incrementally in the original space and learning statically in the reduced space. If Proposition 1 holds for all weights w &#8712; W and for all iterations i = 1 . . . T , then the weights learned in the IPCA reduced space will be equivalent to those learned in the static PCA reduced space, modulo the approximation error incurred by IPCA.</p><p>Here, "static PCA" refers to first learning and applying PCA to all of the training data, then learning the weights solely in the PCA reduced space.</p><p>Proof of Proposition 1. This equivalence can be easily shown using linearity, with coefficients &#945; determined by the learning rate:</p><p>The top line in 1 represents learning in the current reduced space, while the bottom two lines represent previously learning in the original space, then applying the current version of IPCA.</p><p>In practice, it is unlikely that Proposition 1 will hold due to its strong assumption that matches will be the same across various iterations of IPCA. However, we do expect matches to be similar (though not exactly the same) across iterations of IPCA. Our experimental results in Table <ref type="table">1</ref> demonstrate that this approximate equivalence holds in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 RELATION TO ART</head><p>Our proposed algorithm is inspired by ART, and essentially performs incremental winner-take-all clustering. However, there are several differences between our algorithm and ART-based algorithms.</p><p>Our algorithm differs from ART as it (1) does not include distinct bottom-up and top-down matching phases, (2) uses linear rather than fuzzy intersection weight updates, and (3) avoids match tracking. Specifically, our algorithm uses the alternative to match tracking described in <ref type="bibr">Anagnostopoulos &amp; Georgiopoulos (2003)</ref> which satisfies Fuzzy ARTMAP's incremental learning principle.</p><p>Because of these differences, some of ART's properties do not apply to our proposed algorithm. For example, the use of fuzzy intersection weight updates ensures that categories only shrink, implying that weights do not repeat previously held values <ref type="bibr">(Carpenter et al., 1992)</ref>. (We do not use fuzzy intersection as it is nonlinear, thus incompatible with Proposition 1). Still, our algorithm retains arguably the most important property of ART: the use of competition to protect previously learned knowledge. For further details on the ART framework and its applications in machine learning, we direct readers to this review article <ref type="bibr">(Grossberg, 2020)</ref> and survey <ref type="bibr">(da Silva et al., 2019)</ref>.</p><p>A.5 DETAILS AND DISCUSSION OF EMPIRICAL RESULTS</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.5.1 PROBLEM SETTING</head><p>We evaluate our model on the commonly used Split-MNIST and Split-Fashion-MNIST datasets in Tables <ref type="table">1</ref> and<ref type="table">2</ref>. Each dataset has 10 classes, divided into 5 binary classification problems: classify 0 vs 1, 2 vs 3, 4 vs 5, 6 vs 7, and 8 vs 9. Results are reported as the average performance across each of the 5 tasks after learning all tasks sequentially (i.e. tasks 1 -5 are first learned in order, then task 1 is evaluated, then task 2 is evaluated, etc).</p><p>Our evaluation was conducted in the single-headed setting with task-identifiers unavailable to the model. In our setting, this is equivalent to class-incremental learning (CIL). We chose this setting because it satisfies rigorous definitions of continual learning and is significantly more challenging than the multi-headed setting <ref type="bibr">(Farquhar &amp; Gal, 2018)</ref> in which task identifiers are provided at inference. Note that the multi-headed setting can be trivially solved by training isolated models for each task. For completeness, we include our proposed model's performance in the multi-headed setting, equivalent to task-incremental learning (TIL), in Table <ref type="table">2</ref>. Many continual deep learning approaches achieve comparably strong performance in this easier setting <ref type="bibr">(Hsu et al., 2018)</ref>.  <ref type="table">2</ref>, we compare our proposed IPCA model against variations with no form of PCA and with static PCA trained on the entire dataset. Assuming that PCA improves matching performance, we expect the "No PCA" and "Static PCA" models to serve as approximate lower and upper bounds for "IPCA" model performance. In the single-headed setting, we find that our proposed model with IPCA significantly the model learned without PCA and does not significantly differ from the model learned with static PCA (using paired t-tests with significance level &#945; = 0.05). In the easier multi-headed setting, we find that all models offer strong performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.5.3 DEEP LEARNING BASELINES</head><p>To understand our proposed model's performance in the context of recent work on continual learning, we include three deep learning baselines in the bottom half of Table <ref type="table">1</ref>. We include one method from each of the three main categories of continual learning approaches. Elastic weight consolidation (EWC) <ref type="bibr">(Kirkpatrick et al., 2017</ref>) is a regularization-based approach, incremental classifier and representation learning (iCaRL) <ref type="bibr">(Rebuffi et al., 2017</ref>) is a rehearsal-based approach, and dynamically expandable networks (DEN) <ref type="bibr">(Yoon et al., 2018</ref>) is an architecture-based approach. These results are adopted from <ref type="bibr">Sokar et al. (2021)</ref>, which includes results from <ref type="bibr">Hsu et al. (2018)</ref> and van de Ven &amp; Tolias (2019). We include these baselines to contextualize our model's performance. While we do not expect our model to outperform all deep learning methods in terms of predictive accuracy, our results show that we achieve comparable performance with a much simpler approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.5.4 EXPERIMENTAL DETAILS</head><p>The following describes how we evaluated our proposed algorithm in Tables <ref type="table">1</ref> and<ref type="table">2</ref>. We use 2000 train samples and 500 test samples of each class. Our results are computed over 25 trials. In each trial, the training data is randomly shuffled, resulting in different presentation orderings across trials -this is the only source of randomness in our algorithm. We use PCA/IPCA to reduce to dimensions k = 200 for the MNIST dataset and k = 250 for the Fashion-MNIST dataset, which corresponds to preserving slightly more than 95% of the variance in the training data. Both datasets have original dimension d = 784. We use vigilance parameter &#961; = 0.5, cosine similarity function S(&#8226;, &#8226;), and adaptive learning rate 1/n s where n s denotes the total number of samples which have matched with the winning weight (i.e. the weights are simply an average of their corresponding samples). These values were determined through basic hyperparameter tuning on a validation set with 500 samples per class. We use the scikit-learn <ref type="bibr">(Pedregosa et al., 2011)</ref> implementation of IPCA based on <ref type="bibr">Ross et al. (2008)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.6 FUTURE WORK</head><p>There are several potential extensions of our proposed algorithm. First, note that the main idea behind Algorithm 1 in Proposition 1 could conceivably be applied to other types of incremental learning algorithms beyond IPCA and ART. For example, an incremental version of linear discriminant analysis (LDA) <ref type="bibr">(Pang et al., 2005)</ref> could be used as an alternative to IPCA.</p><p>Second, it would be interesting to consider hierarchical extensions of IPCA. This is a way to perform local dimensionality reduction <ref type="bibr">(Chakrabarti &amp; Mehrotra, 2000)</ref> to increase the amount of retained variance at lower dimensions. Specifically, a global version of IPCA could be applied when clustering data with ART in an unsupervised manner, then local versions of IPCA could be applied to the data within each cluster. This process can be repeated to achieve hierarchical representation learning or hierarchical clustering. Proposition 1 allows this to be done in the continual learning setting as data arrives sequentially through time.</p><p>Finally, it would be interesting to further explore methods like ART for continual learning. Recent work on continual learning is focused on retrofitting deep learning methods in order to avoid catastrophic forgetting. ART is able to avoid catastrophic forgetting due to the inherent properties of its architectures and algorithm. A deeper understanding of these properties could lead to innovative variants and extensions of ART for continual learning.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The phrase "sample x matches to weight w" means that w is the weight with the maximum similarity score with respect to x in line 7 of Algorithm 1. Weight w is also referred to as the "winner" of ART's competition.</p></note>
		</body>
		</text>
</TEI>
