NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Continual Learning in Linear Classification on Separable Data

Evron, Itay; Moroshko, Edward; Buzaglo, Gon; Khriesh, Maroun; Marjeh, Badea; Srebro, Nati; Soudry, Daniel (January 2023, ICML)

Full Text Available
How catastrophic can catastrophic forgetting be in linear regression?

Evron, Itay; Moroshko, Edward; Ward, Rachel; Srebro, Nathan; Soudry (January 2022, Conference on Learning Theory)

Full Text Available
On the Implicit Bias of Initialization Shape: Beyond Infinitesimal Mirror Descent

Azulay, Shahar; Moroshko, Edward; Nacson, MS; Woodworth, Blake; Srebro, Nathan; Globerson, Amir; Soudry, Daniel (July 2021, Proceedings of Machine Learning Research)
null (Ed.)
Recent work has highlighted the role of initialization scale in determining the structure of the solutions that gradient methods converge to. In particular, it was shown that large initialization leads to the neural tangent kernel regime solution, whereas small initialization leads to so called “rich regimes”. However, the initialization structure is richer than the overall scale alone and involves relative magnitudes of different weights and layers in the network. Here we show that these relative scales, which we refer to as initialization shape, play an important role in determining the learned model. We develop a novel technique for deriving the inductive bias of gradientflow and use it to obtain closed-form implicit regularizers for multiple cases of interest.
more » « less
Full Text Available
Implicit Bias in Deep Linear Classification: Initialization Scale vs Training Accuracy

Moroshko, Edward; Gunasekar, Suriya; Woodworth, Blake; Lee, Jason D; Srebro, Nathan; Soudry, Daniel (July 2020, Advances in neural information processing systems)
null (Ed.)
We provide a detailed asymptotic study of gradient flow trajectories and their implicit optimization bias when minimizing the exponential loss over "diagonal linear networks". This is the simplest model displaying a transition between "kernel" and non-kernel ("rich" or "active") regimes. We show how the transition is controlled by the relationship between the initialization scale and how accurately we minimize the training loss. Our results indicate that some limit behaviors of gradient descent only kick in at ridiculous training accuracies (well beyond 10−100). Moreover, the implicit bias at reasonable initialization scales and training accuracies is more complex and not captured by these limits.
more » « less
Full Text Available
Kernel and Rich Regimes in Overparametrized Models

Woodworth, Blake; Gunasekar, Suriya; Lee, Jason D; Moroshko, Edward; Savarese, Pedro; Golan, Itay; Soudry, Daniel; Srebro, Nathan (February 2020, Conference on Learning Theory (COLT))

A recent line of work studies overparametrized neural networks in the “kernel regime,” i.e. when during training the network behaves as a kernelized linear predictor, and thus, training with gradient descent has the effect of finding the corresponding minimum RKHS norm solution. This stands in contrast to other studies which demonstrate how gradient descent on overparametrized networks can induce rich implicit biases that are not RKHS norms. Building on an observation by \citet{chizat2018note}, we show how the \textbf{\textit{scale of the initialization}} controls the transition between the “kernel” (aka lazy) and “rich” (aka active) regimes and affects generalization properties in multilayer homogeneous models. We provide a complete and detailed analysis for a family of simple depth-D linear networks that exhibit an interesting and meaningful transition between the kernel and rich regimes, and highlight an interesting role for the \emph{width} of the models. We further demonstrate this transition empirically for matrix factorization and multilayer non-linear networks.
more » « less
Full Text Available
Kernel and Rich Regimes in Overparametrized Models

Woodworth, Blake; Gunasekar, Suriya; Lee, Jason D.; Moroshko, Edward; Savarese, Pedro; Golan, Itay; Soudry, Daniel; Srebro, Nati (January 2020, COLT 2020)

Full Text Available

Search for: All records