Title: Solving Sparse Linear Systems Faster than Matrix Multiplication

Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an $n \times n$ linear system $Ax=b$ is $\tilde{O}(n^\omega)$, where $\omega < 2.372864$ is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly$(n)$ condition number.
In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any $\omega > 2$. This speedup holds for any input matrix $A$ with $o(n^{\omega -1}/\log(\kappa(A)))$ non-zeros, where $\kappa(A)$ is the condition number of $A$. For poly$(n)$-conditioned matrices with $\tilde{O}(n)$ nonzeros, and the current value of $\omega$, the bit complexity of our algorithm to solve to within any $1/\text{poly}(n)$ error is $O(n^{2.331645})$.
Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC `06 `07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semi-random matrices. more »« less

In a Merlin–Arthur proof system, the proof verifier (Arthur) accepts valid proofs (from Merlin) with probability 1, and rejects invalid proofs with probability arbitrarily close to 1. The running time of such a system is defined to be the length of Merlin’s proof plus the running time of Arthur. We provide new Merlin–Arthur proof systems for some key problems in fine-grained complexity. In several cases our proof systems have optimal running time. Our main results include:

Certifying that a list ofnintegers has no 3-SUM solution can be done in Merlin–Arthur time$$\tilde{O}(n)$$$\stackrel{~}{O}\left(n\right)$. Previously, Carmosino et al. [ITCS 2016] showed that the problem has a nondeterministic algorithm running in$$\tilde{O}(n^{1.5})$$$\stackrel{~}{O}\left({n}^{1.5}\right)$time (that is, there is a proof system with proofs of length$$\tilde{O}(n^{1.5})$$$\stackrel{~}{O}\left({n}^{1.5}\right)$and a deterministic verifier running in$$\tilde{O}(n^{1.5})$$$\stackrel{~}{O}\left({n}^{1.5}\right)$time).

Counting the number ofk-cliques with total edge weight equal to zero in ann-node graph can be done in Merlin–Arthur time$${\tilde{O}}(n^{\lceil k/2\rceil })$$$\stackrel{~}{O}\left({n}^{\lceil k/2\rceil}\right)$(where$$k\ge 3$$$k\ge 3$). For oddk, this bound can be further improved for sparse graphs: for example, counting the number of zero-weight triangles in anm-edge graph can be done in Merlin–Arthur time$${\tilde{O}}(m)$$$\stackrel{~}{O}\left(m\right)$. Previous Merlin–Arthur protocols by Williams [CCC’16] and Björklund and Kaski [PODC’16] could only countk-cliques in unweighted graphs, and had worse running times for smallk.

Computing the All-Pairs Shortest Distances matrix for ann-node graph can be done in Merlin–Arthur time$$\tilde{O}(n^2)$$$\stackrel{~}{O}\left({n}^{2}\right)$. Note this is optimal, as the matrix can have$$\Omega (n^2)$$$\Omega \left({n}^{2}\right)$nonzero entries in general. Previously, Carmosino et al. [ITCS 2016] showed that this problem has an$$\tilde{O}(n^{2.94})$$$\stackrel{~}{O}\left({n}^{2.94}\right)$nondeterministic time algorithm.

Certifying that ann-variablek-CNF is unsatisfiable can be done in Merlin–Arthur time$$2^{n/2 - n/O(k)}$$${2}^{n/2-n/O\left(k\right)}$. We also observe an algebrization barrier for the previous$$2^{n/2}\cdot \textrm{poly}(n)$$${2}^{n/2}\xb7\text{poly}\left(n\right)$-time Merlin–Arthur protocol of R. Williams [CCC’16] for$$\#$$$\#$SAT: in particular, his protocol algebrizes, and we observe there is no algebrizing protocol fork-UNSAT running in$$2^{n/2}/n^{\omega (1)}$$${2}^{n/2}/{n}^{\omega \left(1\right)}$time. Therefore we have to exploit non-algebrizing properties to obtain our new protocol.

Certifying a Quantified Boolean Formula is true can be done in Merlin–Arthur time$$2^{4n/5}\cdot \textrm{poly}(n)$$${2}^{4n/5}\xb7\text{poly}\left(n\right)$. Previously, the only nontrivial result known along these lines was an Arthur–Merlin–Arthur protocol (where Merlin’s proof depends on some of Arthur’s coins) running in$$2^{2n/3}\cdot \textrm{poly}(n)$$${2}^{2n/3}\xb7\text{poly}\left(n\right)$time.

Due to the centrality of these problems in fine-grained complexity, our results have consequences for many other problems of interest. For example, our work implies that certifying there is no Subset Sum solution tonintegers can be done in Merlin–Arthur time$$2^{n/3}\cdot \textrm{poly}(n)$$${2}^{n/3}\xb7\text{poly}\left(n\right)$, improving on the previous best protocol by Nederlof [IPL 2017] which took$$2^{0.49991n}\cdot \textrm{poly}(n)$$${2}^{0.49991n}\xb7\text{poly}\left(n\right)$time.

Jambulapati, Arun; Li, Jerry; Musco, Christopher; Shiragur, Kirankumar; Sidford, Aaron; Tian, Kevin(
, Advances in Neural Information Processing Systems)

We develop a general framework for finding approximately-optimal preconditioners for solving linear systems. Leveraging this framework we obtain improved runtimes for fundamental preconditioning and linear system solving problems including the following.
\begin{itemize}
\item \textbf{Diagonal preconditioning.} We give an algorithm which, given positive definite $\mathbf{K} \in \mathbb{R}^{d \times d}$ with $\mathrm{nnz}(\mathbf{K})$ nonzero entries, computes an $\epsilon$-optimal diagonal preconditioner in time $\widetilde{O}(\mathrm{nnz}(\mathbf{K}) \cdot \mathrm{poly}(\kappa^\star,\epsilon^{-1}))$, where $\kappa^\star$ is the optimal condition number of the rescaled matrix.
\item \textbf{Structured linear systems.} We give an algorithm which, given $\mathbf{M} \in \mathbb{R}^{d \times d}$ that is either the pseudoinverse of a graph Laplacian matrix or a constant spectral approximation of one, solves linear systems in $\mathbf{M}$ in $\widetilde{O}(d^2)$ time.
\end{itemize}
Our diagonal preconditioning results improve state-of-the-art runtimes of $\Omega(d^{3.5})$ attained by general-purpose semidefinite programming, and our solvers improve state-of-the-art runtimes of $\Omega(d^{\omega})$ where $\omega > 2.3$ is the current matrix multiplication constant.
We attain our results via new algorithms for a class of semidefinite programs (SDPs) we call \emph{matrix-dictionary approximation SDPs}, which we leverage to solve an associated problem we call \emph{matrix-dictionary recovery}.

Wang, Ruosong; Woodruff, David P.(
, ACM Transactions on Algorithms)

An \ell _p oblivious subspace embedding is a distribution over r \times n matrices \Pi such that for any fixed n \times d matrix A , \[ \Pr _{\Pi }[\textrm {for all }x, \ \Vert Ax\Vert _p \le \Vert \Pi Ax\Vert _p \le \kappa \Vert Ax\Vert _p] \ge 9/10,\] where r is the dimension of the embedding, \kappa is the distortion of the embedding, and for an n -dimensional vector y , \Vert y\Vert _p = (\sum _{i=1}^n |y_i|^p)^{1/p} is the \ell _p -norm. Another important property is the sparsity of \Pi , that is, the maximum number of non-zero entries per column, as this determines the running time of computing \Pi A . While for p = 2 there are nearly optimal tradeoffs in terms of the dimension, distortion, and sparsity, for the important case of 1 \le p \lt 2 , much less was known. In this article, we obtain nearly optimal tradeoffs for \ell _1 oblivious subspace embeddings, as well as new tradeoffs for 1 \lt p \lt 2 . Our main results are as follows: (1) We show for every 1 \le p \lt 2 , any oblivious subspace embedding with dimension r has distortion \[ \kappa = \Omega \left(\frac{1}{\left(\frac{1}{d}\right)^{1 / p} \log ^{2 / p}r + \left(\frac{r}{n}\right)^{1 / p - 1 / 2}}\right).\] When r = {\operatorname{poly}}(d) \ll n in applications, this gives a \kappa = \Omega (d^{1/p}\log ^{-2/p} d) lower bound, and shows the oblivious subspace embedding of Sohler and Woodruff (STOC, 2011) for p = 1 is optimal up to {\operatorname{poly}}(\log (d)) factors. (2) We give sparse oblivious subspace embeddings for every 1 \le p \lt 2 . Importantly, for p = 1 , we achieve r = O(d \log d) , \kappa = O(d \log d) and s = O(\log d) non-zero entries per column. The best previous construction with s \le {\operatorname{poly}}(\log d) is due to Woodruff and Zhang (COLT, 2013), giving \kappa = \Omega (d^2 {\operatorname{poly}}(\log d)) or \kappa = \Omega (d^{3/2} \sqrt {\log n} \cdot {\operatorname{poly}}(\log d)) and r \ge d \cdot {\operatorname{poly}}(\log d) ; in contrast our r = O(d \log d) and \kappa = O(d \log d) are optimal up to {\operatorname{poly}}(\log (d)) factors even for dense matrices. We also give (1) \ell _p oblivious subspace embeddings with an expected 1+\varepsilon number of non-zero entries per column for arbitrarily small \varepsilon \gt 0 , and (2) the first oblivious subspace embeddings for 1 \le p \lt 2 with O(1) -distortion and dimension independent of n . Oblivious subspace embeddings are crucial for distributed and streaming environments, as well as entrywise \ell _p low-rank approximation. Our results give improved algorithms for these applications.

Yang, Kang; Sarkar, Pratik; Weng, Chenkai; Wang, Xiao(
, Proceedings of the ACM Conference on Computer and Communications Security)

null
(Ed.)

Zero-knowledge (ZK) proofs with an optimal memory footprint have attracted a lot of attention, because such protocols can easily prove very large computation with a small memory requirement. Such ZK protocol only needs O(M) memory for both parties, where M is the memory required to verify the statement in the clear. In this paper, we propose several new ZK protocols in this setting, which improve the concrete efficiency and, at the same time, enable sublinear amortized communication for circuits with some notion of relaxed uniformity.
1. In the circuit-based model, where the computation is represented as a circuit over a field, our ZK protocol achieves a communication complexity of 1 field element per non-linear gate for any field size while keeping the computation very cheap.
We implemented our protocol, which shows extremely high efficiency and affordability. Compared to the previous best-known implementation, we achieve 6×–7× improvement in computation and 3×– 7× improvement in communication. When running on intro-level AWS instances, our protocol only needs one US dollar to prove one trillion AND gates (or 2.5 US dollars for one trillion multiplication gates over a 61-bit field).
2. In the setting where part of the computation can be represented as a set of polynomials, we can achieve communication sublinear to the polynomial size: the communication only depends on the input size and the highest degree of all polynomials, independent of the number of polynomials and the number of multiplications in the polynomials.
Using the improved ZK protocol, we can prove matrix multiplication with communication proportional to the input size, rather than the number of multiplications. Proving the multiplication of two 1024 × 1024 matrices, our implementation, with one thread and 1 GB of memory, only needs 10 seconds and communicates 25 MB, 35× faster than the state-of-the-art protocol Virgo that would need more than 140 GB of memory for the same task.

Dalzell, Alexander M.; Hunter-Jones, Nicholas; Brandão, Fernando G. S. L.(
, Communications in Mathematical Physics)

Abstract

We study the distribution over measurement outcomes of noisy random quantum circuits in the regime of low fidelity, which corresponds to the setting where the computation experiences at least one gate-level error with probability close to one. We model noise by adding a pair of weak, unital, single-qubit noise channels after each two-qubit gate, and we show that for typical random circuit instances, correlations between the noisy output distribution$$p_{\text {noisy}}$$${p}_{\text{noisy}}$and the corresponding noiseless output distribution$$p_{\text {ideal}}$$${p}_{\text{ideal}}$shrink exponentially with the expected number of gate-level errors. Specifically, the linear cross-entropy benchmarkFthat measures this correlation behaves as$$F=\text {exp}(-2s\epsilon \pm O(s\epsilon ^2))$$$F=\text{exp}(-2s\u03f5\pm O\left(s{\u03f5}^{2}\right))$, where$$\epsilon $$$\u03f5$is the probability of error per circuit location andsis the number of two-qubit gates. Furthermore, if the noise is incoherent—for example, depolarizing or dephasing noise—the total variation distance between the noisy output distribution$$p_{\text {noisy}}$$${p}_{\text{noisy}}$and the uniform distribution$$p_{\text {unif}}$$${p}_{\text{unif}}$decays at precisely the same rate. Consequently, the noisy output distribution can be approximated as$$p_{\text {noisy}}\approx Fp_{\text {ideal}}+ (1-F)p_{\text {unif}}$$${p}_{\text{noisy}}\approx F{p}_{\text{ideal}}+(1-F){p}_{\text{unif}}$. In other words, although at least one local error occurs with probability$$1-F$$$1-F$, the errors are scrambled by the random quantum circuit and can be treated as global white noise, contributing completely uniform output. Importantly, we upper bound the average total variation error in this approximation by$$O(F\epsilon \sqrt{s})$$$O\left(F\u03f5\sqrt{s}\right)$. Thus, the “white-noise approximation” is meaningful when$$\epsilon \sqrt{s} \ll 1$$$\u03f5\sqrt{s}\ll 1$, a quadratically weaker condition than the$$\epsilon s\ll 1$$$\u03f5s\ll 1$requirement to maintain high fidelity. The bound applies if the circuit size satisfies$$s \ge \Omega (n\log (n))$$$s\ge \Omega (nlog(n\left)\right)$, which corresponds to onlylogarithmic depthcircuits, and if, additionally, the inverse error rate satisfies$$\epsilon ^{-1} \ge {\tilde{\Omega }}(n)$$${\u03f5}^{-1}\ge \stackrel{~}{\Omega}\left(n\right)$, which is needed to ensure errors are scrambled faster thanFdecays. The white-noise approximation is useful for salvaging the signal from a noisy quantum computation; for example, it was an underlying assumption in complexity-theoretic arguments that noisy random quantum circuits cannot be efficiently sampled classically, even when the fidelity is low. Our method is based on a map from second-moment quantities in random quantum circuits to expectation values of certain stochastic processes for which we compute upper and lower bounds.

Peng, Richard, and Vempala, Santosh S. Solving Sparse Linear Systems Faster than Matrix Multiplication. Retrieved from https://par.nsf.gov/biblio/10253016. Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021 . Web. doi:10.1137/1.9781611976465.31.

Peng, Richard, & Vempala, Santosh S. Solving Sparse Linear Systems Faster than Matrix Multiplication. Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021, (). Retrieved from https://par.nsf.gov/biblio/10253016. https://doi.org/10.1137/1.9781611976465.31

Peng, Richard, and Vempala, Santosh S.
"Solving Sparse Linear Systems Faster than Matrix Multiplication". Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021 (). Country unknown/Code not available. https://doi.org/10.1137/1.9781611976465.31.https://par.nsf.gov/biblio/10253016.

@article{osti_10253016,
place = {Country unknown/Code not available},
title = {Solving Sparse Linear Systems Faster than Matrix Multiplication},
url = {https://par.nsf.gov/biblio/10253016},
DOI = {10.1137/1.9781611976465.31},
abstractNote = {Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an $n \times n$ linear system $Ax=b$ is $\tilde{O}(n^\omega)$, where $\omega < 2.372864$ is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly$(n)$ condition number. In this paper, we present an algorithm that solves linear systems in sparse matrices asymptotically faster than matrix multiplication for any $\omega > 2$. This speedup holds for any input matrix $A$ with $o(n^{\omega -1}/\log(\kappa(A)))$ non-zeros, where $\kappa(A)$ is the condition number of $A$. For poly$(n)$-conditioned matrices with $\tilde{O}(n)$ nonzeros, and the current value of $\omega$, the bit complexity of our algorithm to solve to within any $1/\text{poly}(n)$ error is $O(n^{2.331645})$. Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorizations. It is inspired by the algorithm of [Eberly et al. ISSAC `06 `07] for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in eigenvalues of semi-random matrices.},
journal = {Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms, SODA 2021, Virtual Conference, January 10 - 13, 2021},
author = {Peng, Richard and Vempala, Santosh S.},
editor = {null}
}

Warning: Leaving National Science Foundation Website

You are now leaving the National Science Foundation website to go to a non-government website.

Website:

NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.