<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Continual learning: a feature extraction formalization, an efficient algorithm, and barriers</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10450560</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Andrej Risteski Binghui Peng</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Continual learning is an emerging paradigm in machine learning, wherein a model is exposed in an online fashion to data from multiple different distributions (i.e. environments), and is expected to adapt to the distribution change. Precisely, the goal is to perform well in the new environment, while simultaneously retaining the performance on the previous environments (i.e. avoid “catastrophic forgetting”). While this setup has enjoyed a lot of attention in the applied community, there hasn’t be theoretical work that even formalizes the desired guarantees. In this paper, we propose a framework for continual learning through the framework of feature extraction—namely, one in which features, as well as a classifier, are being trained with each environment. When the features are linear, we design an efficient gradient-based algorithm DPGrad, that is guaranteed to perform well on the current environment, as well as avoid catastrophic forgetting. In the general case, when the features are non-linear, we show such an algorithm cannot exist, whether efficient or not.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In the last few years, there has been an increasingly large focus in the modern machine learning community on settings which go beyond iid data. This has resulted in the proliferation of new concepts and settings such as out-of-distribution generalization <ref type="bibr">[16]</ref>, domain generalization <ref type="bibr">[3]</ref>, multi-task learning <ref type="bibr">[41]</ref>, continual learning <ref type="bibr">[25]</ref> and etc. Continual learning, which is the focus of this paper, concerns learning through a sequence of environments, with the hope of retaining old knowledge while adapting to new environments.</p><p>Unfortunately, despite a lot of interest in the applied community-as evidenced by a multitude of NeurIPS and ICML workshops <ref type="bibr">[26,</ref><ref type="bibr">12,</ref><ref type="bibr">30]</ref>-approaches with formal theoretical guarantees are few and far between. The main reason, similar encountered as its cousin fields like out-of-distribution generalization or multi-tasks learning, usually come with some "intuitive" desiderata -but no formal definitions. What's worse, it's often times clear that without strong data assumptions-the problem is woefully ill-defined.</p><p>The intuitive desiderata the continual learning community has settled on is that the setting involves cases where an algorithm is exposed (in an online fashion) to data sequentially coming from different distributions (typically called "environments", inspired from a robot/agent interacting with different environments). Moreover, the goal is to keep the size of the model being trained fixed, and make sure the model performs well on the current environment while simultaneously maintaining a good performance in the previously seen environments. In continual learning parlance, this is termed "resistance to catastrophic forgetting".</p><p>It is clear that some of the above desiderata are shared with well-studied learning theory settings (e.g. online learning, lifelong learning), while some aspects differ. For example, in online learning, we don't care about catastrophic forgetting (or we only do so in some averaged sense); in lifelong learning, it's not necessary to keep the size of the model fixed. It is also clear that absent some assumptions on the data and the model being trained, these desiderata cannot possibly be satisfied: why would there even exist a model of some fixed size that performs well on both past environments, and current ones -let alone one that gets updated in an online fashion.</p><p>A feature-extraction formalization of continual learning Our paper formalizes a setting for continual learning through the lens of feature extraction: the model maintains a fixed number of (trainable) features, as well as a linear classifier on top of said features. The features are updated for every new environment, with the objective that the features are such that a good linear classifier exists for the new environment, while the previously trained linear classifiers (on the updated features) are still good for the past environments. The reason the linear classifiers from previous rounds are not allowed to be updated is storage efficiency: in order to tune the prompts, one needs to store the training data from previous tasks, this would bring a storage overhead and potentially privacy concerns. This is a very common approach in practice-examples of this are systems involving large amounts of data of a streaming nature (e.g. Google searches, Youtube, a robotic agent interacting with a continual stream of environments), and it be would prohibitive to store it for later fine tuning. The number of features is kept fixed for the same reason: if we are to expand with new features for every new environment, the model size (and hence storage requirements) would grow.</p><p>We prove two main results for our setting.</p><p>1. When the features are a linear function of the input data, and a good set of features exist, we design an efficient algorithm, named doubly projected gradient descent, or DPGrad, that has a good accuracy on all environments, and resists catastrophic forgetting. Our algorithm, while being novel, bears some resemblance to a class of projection-based algorithms used in practice <ref type="bibr">[11,</ref><ref type="bibr">5]</ref> -we hope the theoretical analysis can shed insight onto large scale continual learning.</p><p>2 Our results</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Problem formulation</head><p>In a continual learning problem, the learner has sequential access to k environments. In the i-th (i 2 [k]) environment, the data is drawn i.i.d. from the underlying distribution D i over R d &#8677; R, denoted as (x, y) &#8672; D i , where x 2 R d is the input and y 2 R is the label. Motivated by the empirical success of representation learning <ref type="bibr">[2,</ref><ref type="bibr">9]</ref>, we formulate the continual learning problem through the feature extraction view: The learner is required to learn a common feature mapping (also known as representation function) R : R d ! R r that maps the input data x 2 R d to a low dimensional representation R(x) 2 R r (r &#8999; d), along with a sequence of task-dependent linear classifiers (also known as linear prompts) v 1 , . . . , v k 2 R r on top of the representation. Precisely, the prediction of the i-th environment is made by f</p><p>As this is the first-cut study, we focus on the realizable and the proper learning setting. <ref type="foot">1</ref> That is, we assume the existence of a feature mapping R in the function class H (which is known in advance) and a sequence of linear predictor v 1 , . . . , v k such that for any i 2 [k] and any data (x, y) &#8672; D i , y = hv i , R(x)i (realizable). The learner is required to output a function R that belongs to the hypothesis class H (i.e. the learner is proper).</p><p>Remark 2.1 (Known environment identity). Our model requires the knowledge of environment identity at test time, and thus can be classified into the category of incremental task learning. We note there is also empirical research focusing on unknown environment identity, which would be an interesting direction for future work (See <ref type="bibr">Section 7)</ref>.</p><p>The guarantee that we wish our learning algorithm to obtain is as follows: Definition 2.2 (Goal of Continual Learning). Let d, k, r 2 N, r &#8999; d, k, &#9999; 2 (0, 1/2). Let H be a function class in which the feature mappings from R d to R r lie. The continual learning problem involves k environments D 1 , . . . , D k . We assume there exists a function R ?</p><p>2 H and a sequence of linear classifiers v ? 1 , . . . , v ? k 2 R r such that for any (x, y) &#8672; D i (i 2 [k]), the label satisfies y = hv ? i , R ? (x)i. The continual learner has sequential access to environments D 1 , . . . , D k , as well as access to arbitrarily many samples per environment<ref type="foot">foot_3</ref> . The goal is to learn a representation function R 2 H and a sequence of linear prompts v 1 , . . . , v k 2 R r that achieve a good accuracy on the current task and do not suffer from catastrophic forgetting. Formally, in the i-th environment (i 2 [k]), the learner optimizes the feature mapping R and the linear classifier v i (without changing v 1 , . . . , v i 1 ) and aims to satisfy</p><p>&#8226; Avoid catastrophic forgetting: During the execution of the i-th task, the algorithm guarantees that</p><p>&#63743; &#9999; for all j = 1, . . . , i 1,</p><p>&#8226; Good accuracy on the current task: At the end of i-th task, the algorithm guarantees that</p><p>For linear feature mapping, the representation function can be written in a linear form R(x) = U &gt; x for some U 2 R d&#8677;r , and it implies the i-th environment is generated by a linear model. That is, defining</p><p>3 (The benefit of continual learning with linear feature). Note, for linear features, it's in principle possible to just learn a sequence of linear classifiers w 1 , . . . , w k 2 R d separatelywithout learning a low-dimensional featurizer. Choosing an r-dimensional featurizer confers memory efficiency (O(kr + dr) vs. O(dk)) and sample efficiency (O(r) vs. O(d) samples per task in the asymptotic regime k ! 1). Furthermore, the linear case is a sandbox that can be mathematically analyzed and can generate insights for the nonlinear case as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">DPGrad: Efficient gradient based method for linear features</head><p>For the case of linear features, we propose an efficient algorithm which we term DPGrad (pseudocode in Algorithm 1), which is an efficient gradient based method and provably learns the representation while avoids catastrophic forgetting. Towards stating the result, we make a few technical assumptions. Assumption 2.4 (Distribution assumption). For any i 2 [k], we assume D i has zero means and it is in isotropic position, that is,</p><p>This assumption is largely for convenience. In fact, one can replace the isotropic condition with a general bounded covariance assumption, our algorithm still can work with extra preprocessing step, and the sample complexity scales with the condition number of covariance matrix. Assumption 2.6 (Range assumption). For any i 2 [k], w i has bounded norm, i.e., kw i k 2 &#63743; D. Assumption 2.7 (Signal assumption). For any i 2 [k], let W i = span(w 1 , . . . , w i ), W i,? be the space perpendicular to W i and P Wi , P W i,? be the projection operator. We assume either w i belongs to</p><p>Remark 2.9. The Range and Signal assumptions are standard in the statistical learning literature. The former ensures an upper bound on kw i k 2 and the later ensures that a new task provides enough "signal" for new features. They are used to set up learning rate and number of gradient iterations.</p><p>Remark 2.10. The Bit complexity assumption states that w i can be described with a finite number of bits, and is mostly for convenience -namely so we can argue we exactly recover w i -which makes calculations involving projections of features learned in the past cleaner. Since the number of gradient iterations only depends logarithmically on &#9003;, one can relax the bit complexity restriction to only approximately recovering the ground truth features up to a polynomially small (in d, k, D, &#9999;) precision. Our argument can still go through at the cost of some additional error analysis.</p><p>The main result is then as follows: Theorem 2.11 (Continual learning with linear features). Let k, d, r 2 N, r &#8999; k, d, &#9999; 2 (0, 1/2). When the features are a linear function over the input data, under Assumption 2.4 and Assumption 2.6-2.8, with high probability, DPGrad provably achieves good loss and avoids catastrophic forgetting. In particular, during the execution of i-th environment, DPGrad always ensures</p><p>and at the end of i-th environment, DPGrad ensures</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Fundamental obstructions for non-linear features</head><p>The continual learning setup with non-linear features turns out to be much more difficult -even without computational constraints. Our result rules out the existence of a proper continual learner, even when all environment distributions are uniform and the representation function is realizable by a two-layer convolutional neural network. Theorem 2.12 (Barrier for Continual learning with non-linear feature). Let k, r 2, d 3. There exists a class of non-linear feature mappings and a sequence of environments, such that there is no (proper) continual learning algorithm that can guarantee to achieve less than 1 1000 -error over all environments with probability at least 1/2,under the feature extraction formalization of Definition 2.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Related work</head><p>Continual learning in practice The study of continual learning (or lifelong learning) dates back to the work of <ref type="bibr">[36]</ref> and it receives a surge of research interest over recent years <ref type="bibr">[15,</ref><ref type="bibr">19,</ref><ref type="bibr">11,</ref><ref type="bibr">5,</ref><ref type="bibr">14,</ref><ref type="bibr">31,</ref><ref type="bibr">34,</ref><ref type="bibr">17,</ref><ref type="bibr">29,</ref><ref type="bibr">39,</ref><ref type="bibr">18]</ref>. A central challenge in the field is to avoid catastrophic forgetting <ref type="bibr">[24,</ref><ref type="bibr">23]</ref>, which the work of <ref type="bibr">[15]</ref> observed happened for gradient-based training of neural networks. While there is a large amount of empirical work, we'll briefly summarize the dominant approaches. The regularization based approach alleviates catastrophic forgetting by posing constraints on the update of the neural weights. The elastic weight consolidation (EWC) approach <ref type="bibr">[19]</ref> adds weighted `2 regularization to the objective function that penalizes the movement of neural weights. The orthogonal gradient descent (OGD) algorithm from <ref type="bibr">[11,</ref><ref type="bibr">5]</ref> enforces the gradient update being orthogonal to update direction (by viewing the gradients as a high dimensional vector). The memory replay approach restores data from previous tasks and alleviates catastrophic forgetting by rehearsing in the later tasks. <ref type="bibr">[31]</ref> introduces experience replay to continual learning. <ref type="bibr">[14]</ref> trains a deep generative model (a.k.a. GAN) to simulate past dataset for future use. The dynamic architecture approach dynamically adjusts the neural network architecture to incorporate new knowledge and avoid forgetting. The progressive neural network <ref type="bibr">[34]</ref> blocks changes to the existing network and expands the architecture by allocating a new subnet to be trained with the new information. We refer the interested reader to more complete surveys <ref type="bibr">[25,</ref><ref type="bibr">7]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Continual learning in theory</head><p>In comparison to the vast empirical literature, theoretical works are comparatively few. The work of <ref type="bibr">[6]</ref> characterize the memory requirement of continual learning, when the environment identity is unknown. The works <ref type="bibr">[35,</ref><ref type="bibr">27,</ref><ref type="bibr">1,</ref><ref type="bibr">4]</ref> provide sample complexity guarantees on lifelong learning. Their approaches can be categorized roughly into the duplicate and fine-tuning paradigm: The algorithm maintains a weighted combination over a family of representation functions and the focus is on the sample complexity guarantee. By contrast, we focus on the feature extraction paradigm and learn linear prompts on top of a single representation function. Both the duplicate-andfine-tuning and the feature extraction paradigm have been extensively investigated in the literature, detailed discussions can be found at <ref type="bibr">[7]</ref> and we provide a brief comparison. From an algorithmic perspective, learning a weighted combination over a family of representation functions (i.e. the duplicate and fine-tuning) is much easier, as one can always initiates a new representation function for a new task. The algorithmic convenience allows previous literature focus more on the generalization and sample complexity guarantee, culminating with the recent work of <ref type="bibr">[4]</ref>. We note again that learning a single representation function and task specific linear prompts is much more challenging, but has practical benefits, e.g. memory efficiency. For example, in the applications of NLP, the basic representation function (e.g. BERT <ref type="bibr">[9]</ref>) is already overparameterized and contains billions of parameters. It is then formidable to maintain a large amount of the basic models and learn a linear combination over them. We mention several more works that are morally related in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Continual learning with linear feature</head><p>We restate our main result for linear feature mapping. When the features are a linear function over the input data, under Assumption 2.4 and Assumption 2.6-2.8, with high probability, DPGrad provably achieves good loss and avoids catastrophic forgetting. In particular, during the execution of i-th environment, DPGrad always ensures</p><p>and at the end of i-th environment, DPGrad ensures</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Algorithm</head><p>A complete and formal description of DPGrad is presented in Algorithm 1. DPGrad simultaneously updates the matrix of features U , as well as the linear classifier v i using gradient descent-with the restriction that the update of U only occurs along directions that are orthogonal to the column and row span of the previous feature matrix. Intuitively, one wishes the projection guarantees that existing features that have been learned are not erased or interfered by new environments. Due to the quadratic nature of the loss, and the appearance of "cross-terms", this turns out to require both column and row orthogonality, and interestingly deviates from the practically common OGD method <ref type="bibr">[11,</ref><ref type="bibr">5]</ref>.</p><p>In more detail, at the beginning of the i-th (i 2 [k]) environment, DPGrad adds Gaussian noise to the feature matrix U and the linear classifier v i , to generate a good initialization for U and v i . Subsequently, we perform gradient descent to both the feature mapping matrix U and linear classifier v i -except U is only updated along orthogonal directions w.r.t. the column span and the row span. At the end of each environment, DPGrad has a post-processing step to recover the ground truth w i by rounding each entry of Uv i to the nearest multiple of &#9003;, <ref type="foot">3</ref> and then update the column and row span if the orthogonal component is non-negligible. The reason for the later step is that we only need to preserve row space when encountering new features.</p><p>Parameters We use to denote the initialization scale, &#8984; to denote the learning rate, and T to denote the number of iterations for each task. These are all polynomially small parameters, whose scaling is roughly D, d, k</p><p>to denote a size n 1 &#8677; n 2 matrix whose entries are draw from random Gaussian N(0, 1). For each i 2 [k], t 2 [0 : T ], denote U i,t to be the feature matrix in the t-th iteration of the i-th environment (after performing the gradient update), denote v i,t similarly. DPGrad includes a projection step at the end of i-th environment, we use U i,end to denote the feature matrix after this projection. We use W i (resp. V i ) to denote the column (resp. row) space maintained at the end of i-th environment. Let W ? &#10003; R n be the subspace orthogonal to W and define V ? similarly. Let P W , P V , P W ? , P V ? be the projection onto W, V, W ? , V ? separately.</p><p>Algorithm 1 Doubly projected gradient descent (DPGrad)</p><p>for t = 1, . . . , T do 7:</p><p>end for 11: end for</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Analysis</head><p>We sketch the analysis of DPGrad and prove Theorem 2.11. Due to space limitation, the detailed proof is deferred to Appendix B. The proof proceeds in the following four steps:</p><p>1. The first step, presented in Section 4.2.1, reduces continual learning to a problem of continual matrix factorization and it allows us to focus on a more algebraically friendly objective function.</p><p>2. We then present some basic linear-algebraic facts to decompose the feature mapping matrix U , its gradient, and the loss into orthogonal components. The orthogonality of gradient update allows us to decouple the process of leveraging the existing features and the process of learning a new feature, as reflected in the loss terms and gradient update rules. See Section 4.2.2 for details.</p><p>3. In Section 4.2.3, we zoom into one single environment, and prove DPGrad provably converges to a global optimum, assuming the feature matrix U from previous environment is well conditioned. This step contains the major bulk of our analysis: The objective function of continual matrix factorization is non-convex, and no regularization or spectral initialization used. (We cannot re-initialize, lest we destroy progress from prior environments.)</p><p>4. Finally, in Section 4.2.4, we inductively prove that DPGrad converges and the feature matrix is always well-conditioned. This wraps up the entire proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Reduction</head><p>We first recall the formal statement of the problem of continual matrix factorization.</p><p>In an continual matrix factorization problem, the algorithm receives w i 2 R d in the i-th step, and it is required to maintain a matrix U 2 R d&#8677;r and output a vector</p><p>The key observation is that running DPGrad on the original continual learning objective <ref type="bibr">(2)</ref>  </p><p>Combining the above observations, it suffices to analyse DPGrad for continual matrix factorization and prove Eq. ( <ref type="formula">3</ref>) and Eq. ( <ref type="formula">4</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Decomposition</head><p>We first provide some basic linear algebraic facts about orthogonal decompositions. For any i 2</p><p>Let w i = w i,A + w i,B where w i,A 2 W i 1 and w i,B 2 W i 1,? . Note this decomposition is unique. We focus on the case that kw i,B k 2 2 [1/D, D] in the following statements, and the case of kw i,B k 2 = 0 carries over easily. (These the only two cases, per Assumption 2.7). Similarly, let</p><p>, where each column of U i,A lies W i 1 and each column of w i,B lies in W i 1,? . (Note, again, U i,A and U i,B are unique.) We further write</p><p>, where the columns of U i,2 lie in W i 1,? \{w i,B }. Finally, denote</p><p>We summarize decompositions mentioned above, with a few additional observations, in the lemma below: Lemma 4.4 (Orthogonal decomposition). For any i 2 [k] and any t 2 [0 : T ], there exists an unique decomposition of U i,t , w i and v i,t of the form</p><p>Here we use column(A), row(A) to denote the column and row space of matrix A, and column(A) 2 W if the column space of A is a subspace of W.</p><p>Since U i,A,t remains unchanged for t = [0 : T ], we abbreviate it as U i,A hereafter. We next provide the exact gradient update of each component under loss function b</p><p>2 and orthogonal projection. Lemma 4.5 (Gradient formula). For any i 2 [k], the gradient update (after projection) obeys the relations <ref type="bibr">(1)</ref> </p><p>We perform a similar decomposition to the loss function.</p><p>Lemma 4.6 (Loss formula).</p><p>Decoupling existing features from "new" features We now offer some intuitive explanation for the decomposition. The first loss term in Eq. ( <ref type="formula">5</ref>) quantifies the error with already learned features. That is, the matrix U i,A stores existing features that have been learned, and it remains unchanged during the execution of the i-th environment; it remains to optimize v i,1,t such that U i,A v i,1,t matches w i,A . The second and last loss term quantify the loss on a new feature, wherew i,B is the new feature component, and the matrix U i,2,t can be thought of as random noise. Intuitively, one should hope</p><p>x &gt; i,t v i,2,t = 1 and this matches the new component of w i,B . At the same time, one hopes U i,2,t would disappear, or at least, kU i,2,t v i,2,t k 2 ! 0 when t ! 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.3">Convergence</head><p>For a fixed environment, we prove w.h.p. DPGrad converges and the loss approaches to zero, given the initial feature mapping matrix U i,A is well conditioned. Lemma 4.7. For any i 2</p><p>where min (U i,A ) and max (U i,A ) denote the minimum and maximum non-zero singular value of matrix U i,A . After</p><p>Outline of the proof DPGrad ensures existing features are preserved and it only optimizes the linear classifier, hence a linear convergence rate can be easily derived for the first loss term, given the feature matrix is well-conditioned. The key part is controlling the terms that capture learning with new features, i.e., the second and last loss term, where both the feature mapping U i,B and linear prompt v i get updated. In this case, the objective is non-convex and non-smooth. Our analysis draws inspiration from the recent work of <ref type="bibr">[40]</ref>, and divides the optimization process into two stages. We prove DPGrad first approaches to a nice initialization position with high probability, and then show linear convergence. <ref type="foot">4</ref>To be concrete, in the first stage, we prove (1) x &gt; i,t v moves closer to 1, and (2) kx i,t kw i,B k 2 v i,2,t k 2 &#8673; 0. That is, the second loss term of Eq. ( <ref type="formula">5</ref>) decreases to a small constant while the pairs x i,t , v i,2,t remain balanced and roughly equal up to scaling. Meanwhile, we note U i,2,t is non-increasing, though the last loss term could still increase because kv i,2,t k 2 increases. In the second stage, we prove by induction that kU &gt; i,2,t v i,2,t k 2 and |x &gt; i,t v i,2,t 1| decay with a linear rate (hence converging to a global optimal), and kx i,t kw i,B k 2 v i,2,t k 2 &#8673; 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.4">Induction</head><p>Lemma 4.7 proves rapid convergence of DPGrad for one single environment. To extend the argument to the whole sequence of environments, we need to ensure (1) the feature matrix is always wellconditioned and (2) catastrophic forgetting does not happen. For (1), we need to analyse the limiting point of DPGrad (there are infinitely many optimal solutions to Eq. ( <ref type="formula">3</ref>)), make sure it is well-balance and orthogonal to previous row/column space. For (2), we make use of the orthogonality of DPGrad.</p><p>Proof Sketch of Theorem 2.11. Thanks to the reduction established in Section 4.2.1, it suffices to prove Eq. (3) and Eq. (4). For each environment i (i 2 [k]), we inductively prove (1) DPGrad achieves good accuracy on the current environment, i.e., kU i,T v i w i k 2 &#63743; &#9999;&#9003;; (2) The feature matrix U i remains well conditioned, i.e. </p><p>When w i,B = 0, one can prove the feature matrix does not change, i.e, U i,end = U i 1,end ; when w i,B 2 [1/D, D], then one can show</p><p>), the feature matrix U remains well-conditioned. The last claim can be derived from the orthogonality. This wraps up the proof of Theorem 2.11.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Lower bound for non-linear features</head><p>We next consider continual learning under a non-linear feature mapping. Learning with non-linear features turns out to be much more difficult, and our main result is to rule out the possibility of a (proper) continual learner. We restate the formal statement. The detailed proof are deferred to Appendix C. Theorem 2.12 (Barrier for Continual learning with non-linear feature). Let k, r 2, d 3. There exists a class of non-linear feature mappings and a sequence of environments, such that there is no (proper) continual learning algorithm that can guarantee to achieve less than 1 1000 -error over all environments with probability at least 1/2,under the feature extraction formalization of Definition 2.2.</p><p>Our lower bound is constructed on a simple family of two-layer convolutional neural network with quadratic activation functions. The input distribution is assumed to be uniform and the target function is a polynomial over the input. The first environment is constructed such that multiple global optimum exist (hence the optimization task is under-constrained). However, if a wrong optimum solution is picked, when the second environment is revealed, the non-linearity makes it impossible to switch back-and-forth.</p><p>Proof Sketch. It suffices to take k = 2, d = 3, r = 2 and we sketch the construction here. For both environments, we assume the input data are drawn uniformly at random from B 3 (0, 1), where B 3 (0, 1) denotes the unit ball in R 3 centered at origin. The hypothesis class H consists of all two-layer convolutional neural network with a single kernel of size 2 and the quadratic activation function. That is, the representation function is parameterized by w 2 R 2 and takes the form of R w (x) = (hw, x 1:2 i 2 , hw, x 2:3 i 2 ) 2 R 2 , where x 2 R 3 , x i:j 2 R j i+1 is a vector consists of the i-th entry to the j-th entry of x. The hard sequence of environments are drawn from the following distribution: (1) The objective function f 1 of the first environment is f 1 (x) = x 2 2 ;</p><p>(2) The objective function f 2 of the second environment equals f 2 (x) = x 2 3 with probability 1/2, and equals f 2 (x) = x 2 1 with probability 1/2. Note the continual learning task is realizable and one can prove no (proper) continual learning algorithm can guarantee to achieve less than 1/1000-error on both environments with probability at least 1/2.</p><p>Though our lower bound instance uses a polynomial activation function, this assumption is not essential -in Appendix C, we prove similar lower bounds with a ReLU activation function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Experiments</head><p>For linear feature functions, we perform simulations on a synthetic dataset to verify the practicality of DPGrad and compare its performance with vanilla SGD and Orthogonal gradient descent (OGD), a close practical cousin of our algorithm. In our simulations, we set d = 100, r = 20, k = 500 and the ground truth U ? , V ? is drawn from Gaussian. The input data are sampled from N (0, I d ) and we draw 1000 samples for each task. Additional details about the setup can be found in Appendix D. Moreover, in Appendix E, we provide additional experimental results on two popular benchmarks, Rotated MNIST and Permuted MNIST. Since DPGrad is designed specifically for linear regression, we provide two variants of DPGrad (without provable guarantees on their performance, of course)one is a modification suitable for multi-class classification, the other is a modification suitable for non-linear featurizers. Detailed numbers and figures can be found in Appendix E. In brief, both algorithms alleviate catastrophic forgetting and perform much better than vanilla SGD. Furthermore, the performance of both is much more stable than OGD and the accuracy remains at a high level across tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this paper, we initiate a study of continual learning through the feature extraction lens, proposing an efficient gradient based algorithm, DPGrad, for the linear case, and a fundamental impossibility result in the general case. Our work leaves several interesting future directions. First, it would be interesting to generalize DPGrad to non-linear feature mappings (perhaps even without provable guarantees) and conduct an empirical study of its performance. Second, our impossibility result does not rule out an improper continual learner, and in general, one can always maintain a task specific representation function and achieve good performance over all environments. It would be thus interesting to investigate what are the fundamental memory-accuracy trade-offs.</p><p>(c) Did you include any new assets either in the supplemental material or as a URL? [N/A] (d) Did you discuss whether and how consent was obtained from people whose data you're using/curating? [N/A] (e) Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>36th Conference on Neural Information Processing Systems (NeurIPS 2022).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>When the features are allowed to be a non-linear function of the input, we show that continual learning is not possible-in general. Namely, we construct an instance for which even if a good set of features exists, the online nature of the setting, as well as the fact that the linear classifiers for past environments are not allowed to be updated, makes it possible for the algorithm to "commit" to linear classifiers, such that either catastrophic forgetting, or poor performance on the current environment has to occur.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_2"><p>We note it is possible to extend our algorithmic result to the non-realizable setting, provided the label has symmetric sub-gaussian noise.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_3"><p>The results easily extend to the finite sample case using standard techniques. We focus on the population results to keep the focus on the online nature of the environments.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_4"><p>This is the only place where we use the Bit complexity assumption.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_5"><p>We note most existing works on matrix factorization or matrix sensing either require some fine-grained initialization (e.g. spectral initialization<ref type="bibr">[8]</ref>) or adding a regularization term that enforces smoothness<ref type="bibr">[13]</ref>, none of which are applicable in our setting.</p></note>
		</body>
		</text>
</TEI>
