Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an $n \times n$ linear system $Ax=b$ is $\tilde{O}(n^\omega)$, where $\omega < 2.372864$ is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly$(n)$ condition number. In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any $\omega > 2$. This speedup holds for any input matrix $A$ with $o(n^{\omega 1}/\log(\kappa(A)))$ nonzeros, where $\kappa(A)$ is the condition number of $A$. For poly$(n)$conditioned matrices with $\tilde{O}(n)$ nonzeros, and the current value of $\omega$, the bit complexity of our algorithm to solve to within any $1/\text{poly}(n)$ error is $O(n^{2.331645})$. Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC `06 `07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anticoncentration techniques to bound the smallest eigenvalue and the smallest gap inmore »
Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations
Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrixvector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent handcrafting these algorithms and implementations is necessary, what structural prior they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrixvector multiplication as products of sparse matrices, we introduce a parameterization of divideandconquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the O(N logN) CooleyTukey FFT algorithm to machine precision, for dimensions N up to 1024. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hiddenlayer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR10 by 3.9 points—the first time a structured approach has done so—with 4X faster inference speed and more »
 Award ID(s):
 1763481
 Publication Date:
 NSFPAR ID:
 10137134
 Journal Name:
 Proceedings of Machine Learning Research
 Volume:
 97
 Page Range or eLocationID:
 15171527
 ISSN:
 26403498
 Sponsoring Org:
 National Science Foundation
More Like this


We study the $\ell_p$ regression problem, which requires finding $\mathbf{x}\in\mathbb R^{d}$ that minimizes $\\mathbf{A}\mathbf{x}\mathbf{b}\_p$ for a matrix $\mathbf{A}\in\mathbb R^{n \times d}$ and response vector $\mathbf{b}\in\mathbb R^{n}$. There has been recent interest in developing subsampling methods for this problem that can outperform standard techniques when $n$ is very large. However, all known subsampling approaches have run time that depends exponentially on $p$, typically, $d^{\mathcal{O}(p)}$, which can be prohibitively expensive. We improve on this work by showing that for a large class of common \emph{structured matrices}, such as combinations of lowrank matrices, sparse matrices, and Vandermonde matrices, there are subsampling based methods for $\ell_p$ regression that depend polynomially on $p$. For example, we give an algorithm for $\ell_p$ regression on Vandermonde matrices that runs in time $\mathcal{O}(n\log^3 n+(dp^2)^{0.5+\omega}\cdot\text{polylog}\,n)$, where $\omega$ is the exponent of matrix multiplication. The polynomial dependence on $p$ crucially allows our algorithms to extend naturally to efficient algorithms for $\ell_\infty$ regression, via approximation of $\ell_\infty$ by $\ell_{\mathcal{O}(\log n)}$. Of practical interest, we also develop a new subsampling algorithm for $\ell_p$ regression for arbitrary matrices, which is simpler than previous approaches for $p \ge 4$.

Modern neural network architectures use structured linear transformations, such as lowrank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps. However, choosing which of the myriad structured transformations to use (and its associated parameterization) is a laborious task that requires trading off speed, space, and accuracy. We consider a different approach: we introduce a family of matrices called kaleidoscope matrices (Kmatrices) that provably capture any structured matrix with nearoptimal space (parameter) and time (arithmetic operation) complexity. We empirically validate that Kmatrices can be automatically learned within endtoend pipelines to replace handcrafted procedures, in order to improve model quality. For example, replacing channel shuffles in ShuffleNet improves classification accuracy on ImageNet by up to 5%. Kmatrices can also simplify handengineered pipelineswe replace filter bank feature computation in speech data preprocessing with a learnable kaleidoscope layer, resulting in only 0.4% loss in accuracy on the TIMIT speech recognition task. In addition, Kmatrices can capture latent structure in models: for a challenging permuted image classification task, adding a Kmatrix to a standard convolutional architecture can enable learning the latent permutation and improve accuracy by over 8 points. We provide a practicallymore »

We develop effective approximation methods for unitary matrices. In our formulation, a unitary matrix is represented as a product of rotations in twodimensional subspaces, socalled Givens rotations. Instead of the quadratic dimension dependence when applying a dense matrix, applying such an approximation scales with the number factors, each of which can be implemented efficiently. Consequently, in settings where an approximation is once computed and then applied many times, such an effective representation becomes advantageous. Although efficient Givens factorizations are not possible for generic unitary operators, we show that minimizing a sparsityinducing objective with a coordinate descent algorithm on the unitary group yields good factorizations for structured matrices. Canonical applications of such a setup are orthogonal basis transforms. We demonstrate that our methods improve the approximate representation of the graph Fourier transform, the matrix obtained when diagonalizing a graph Laplacian.

We develop effective approximation methods for unitary matrices. In our formulation, a unitary matrix is represented as a product of rotations in twodimensional subspaces, socalled Givens rotations. Instead of the quadratic dimension dependence when applying a dense matrix, applying such an approximation scales with the number factors, each of which can be implemented efficiently. Consequently, in settings where an approximation is once computed and then applied many times, such an effective representation becomes advantageous. Although efficient Givens factorizations are not possible for generic unitary operators, we show that minimizing a sparsityinducing objective with a coordinate descent algorithm on the unitary group yields good factorizations for structured matrices. Canonical applications of such a setup are orthogonal basis transforms. We demonstrate that our methods improve the approximate representation of the graph Fourier transform, the matrix obtained when diagonalizing a graph Laplacian.