The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equiweighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bioplausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure n×m, the goal is to find a low rank factorization Zn×rWr×m that minimizes the error ∥ZW−Y∥F. Gradient descent solves this problem optimally. We show that FA finds the optimal solution when r≥rank(Y). We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when r
more »
« less
How and When Random Feedback Works: A Case Study of LowRank Matrix Factorization
The success of gradient descent in ML and especially for learning neural networks is remarkable and robust. In the context of how the brain learns, one aspect of gradient descent that appears biologically difficult to realize (if not implausible) is that its updates rely on feedback from later layers to earlier layers through the same connections. Such bidirected links are relatively few in brain networks, and even when reciprocal connections exist, they may not be equiweighted. Random Feedback Alignment (Lillicrap et al., 2016), where the backward weights are random and fixed, has been proposed as a bioplausible alternative and found to be effective empirically. We investigate how and when feedback alignment (FA) works, focusing on one of the most basic problems with layered structure n×m, the goal is to find a low rank factorization Zn×rWr×m that minimizes the error ∥ZW−Y∥F. Gradient descent solves this problem optimally. We show that FA finds the optimal solution when r≥rank(Y). We also shed light on how FA works. It is observed empirically that the forward weight matrices and (random) feedback matrices come closer during FA updates. Our analysis rigorously derives this phenomenon and shows how it facilitates convergence of FA*, a closely related variant of FA. We also show that FA can be far from optimal when r
more »
« less
 NSFPAR ID:
 10379990
 Date Published:
 Journal Name:
 AISTATS
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Modern deep neural networks (DNNs) often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods. Factorization methods approximate the weight matrix of a DNN layer with the multiplication of two or multiple lowrank matrices. However, it is hard to measure the ranks of DNN layers during the training process. Previous works mainly induce lowrank through implicit approximations or via costly singular value decomposition (SVD) process on every training step. The former approach usually induces a high accuracy loss while the latter has a low efficiency. In this work, we propose SVD training, the first method to explicitly achieve lowrank DNNs during training without applying SVD on every step. SVD training first decomposes each layer into the form of its fullrank SVD, then performs training directly on the decomposed weights. We add orthogonality regularization to the singular vectors, which ensure the valid form of SVD and avoid gradient vanishing/exploding. Lowrank is encouraged by applying sparsityinducing regularizers on the singular values of each layer. Singular value pruning is applied at the end to explicitly reach a lowrank model. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy, comparing to not only previous factorization methods but also stateoftheart filter pruning methods.more » « less

Overparametrization is an important technique in training neural networks. In both theory and practice, training a larger network allows the optimization algorithm to avoid bad local optimal solutions. In this paper we study a closely related tensor decomposition problem: given an lth order tensor in (Rd)⊗l of rank r (where r≪d), can variants of gradient descent find a rank m decomposition where m>r? We show that in a lazy training regime (similar to the NTK regime for neural networks) one needs at least m=Ω(dl−1), while a variant of gradient descent can find an approximate tensor when m=O∗(r2.5llogd). Our results show that gradient descent on overparametrized objective could go beyond the lazy training regime and utilize certain lowrank structure in the data.more » « less

We propose a new randomized optimization method for highdimensional problems which can be seen as a generalization of coordinate descent to random subspaces. We show that an adaptive sampling strategy for the random subspace significantly outperforms the oblivious sampling method, which is the common choice in the recent literature. The adaptive subspace can be efficiently generated by a correlated random matrix ensemble whose statistics mimic the input data. We prove that the improvement in the relative error of the solution can be tightly characterized in terms of the spectrum of the data matrix, and provide probabilistic upperbounds. We then illustrate the consequences of our theory with data matrices of different spectral decay. Extensive experimental results show that the proposed approach offers significant speed ups in machine learning problems including logistic regression, kernel classification with random convolution layers and shallow neural networks with rectified linear units. Our analysis is based on convex analysis and Fenchel duality, and establishes connections to sketching and randomized matrix decompositions.more » « less

null (Ed.)Recent theoretical works have characterized the dynamics of wide shallow neural networks trained via gradient descent in an asymptotic meanfield limit when the width tends towards infinity. At initialization, the random sampling of the parameters leads to deviations from the meanfield limit dictated by the classical Central Limit Theorem (CLT). However, since gradient descent induces correlations among the parameters, it is of interest to analyze how these fluctuations evolve. Here, we use a dynamical CLT to prove that the asymptotic fluctuations around the mean limit remain bounded in mean square throughout training. The upper bound is given by a MonteCarlo resampling error, with a variance that that depends on the 2norm of the underlying measure, which also controls the generalization error. This motivates the use of this 2norm as a regularization term during training. Furthermore, if the meanfield dynamics converges to a measure that interpolates the training data, we prove that the asymptotic deviation eventually vanishes in the CLT scaling. We also complement these results with numerical experiments.more » « less