skip to main content

Title: Improved Convergence for $\ell_\infty$ and $\ell_1$ Regression via Iteratively Reweighted Least Squares
The iteratively reweighted least squares method (IRLS) is a popular technique used in practice for solving regression problems. Various versions of this method have been proposed, but their theoretical analyses failed to capture the good practical performance. In this paper we propose a simple and natural version of IRLS for solving $\ell_\infty$ and $\ell_1$ regression, which provably converges to a $(1+\epsilon)$-approximate solution in $O(m^{1/3}\log(1/\epsilon)/\epsilon^{2/3} + \log m/\epsilon^2)$ iterations, where $m$ is the number of rows of the input matrix. Interestingly, this running time is independent of the conditioning of the input, and the dominant term of the running time depends sublinearly in $\epsilon^{-1}$, which is atypical for the optimization of non-smooth functions. This improves upon the more complex algorithms of Chin et al. (ITCS '12), and Christiano et al. (STOC '11) by a factor of at least $1/\epsilon^2$, and yields a truly efficient natural algorithm for the slime mold dynamics (Straszak-Vishnoi, SODA '16, ITCS '16, ITCS '17).  more » « less
Award ID(s):
1718342 1750333
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We give two new quantum algorithms for solving semidefinite programs (SDPs) providing quantum speed-ups. We consider SDP instances with m constraint matrices, each of dimension n, rank at most r, and sparsity s. The first algorithm assumes an input model where one is given access to an oracle to the entries of the matrices at unit cost. We show that it has run time O~(s^2 (sqrt{m} epsilon^{-10} + sqrt{n} epsilon^{-12})), with epsilon the error of the solution. This gives an optimal dependence in terms of m, n and quadratic improvement over previous quantum algorithms (when m ~~ n). The second algorithm assumes a fully quantum input model in which the input matrices are given as quantum states. We show that its run time is O~(sqrt{m}+poly(r))*poly(log m,log n,B,epsilon^{-1}), with B an upper bound on the trace-norm of all input matrices. In particular the complexity depends only polylogarithmically in n and polynomially in r. We apply the second SDP solver to learn a good description of a quantum state with respect to a set of measurements: Given m measurements and a supply of copies of an unknown state rho with rank at most r, we show we can find in time sqrt{m}*poly(log m,log n,r,epsilon^{-1}) a description of the state as a quantum circuit preparing a density matrix which has the same expectation values as rho on the m measurements, up to error epsilon. The density matrix obtained is an approximation to the maximum entropy state consistent with the measurement data considered in Jaynes' principle from statistical mechanics. As in previous work, we obtain our algorithm by "quantizing" classical SDP solvers based on the matrix multiplicative weight update method. One of our main technical contributions is a quantum Gibbs state sampler for low-rank Hamiltonians, given quantum states encoding these Hamiltonians, with a poly-logarithmic dependence on its dimension, which is based on ideas developed in quantum principal component analysis. We also develop a "fast" quantum OR lemma with a quadratic improvement in gate complexity over the construction of Harrow et al. [Harrow et al., 2017]. We believe both techniques might be of independent interest. 
    more » « less
  2. We consider the problem of maximizing the multilinear extension of a submodular function subject a single matroid constraint or multiple packing constraints with a small number of adaptive rounds of evaluation queries. We obtain the first algorithms with low adaptivity for submodular maximization with a matroid constraint. Our algorithms achieve a $1-1/e-\epsilon$ approximation for monotone functions and a $1/e-\epsilon$ approximation for non-monotone functions, which nearly matches the best guarantees known in the fully adaptive setting. The number of rounds of adaptivity is $O(\log^2{n}/\epsilon^3)$, which is an exponential speedup over the existing algorithms. We obtain the first parallel algorithm for non-monotone submodular maximization subject to packing constraints. Our algorithm achieves a $1/e-\epsilon$ approximation using $O(\log(n/\epsilon) \log(1/\epsilon) \log(n+m)/ \epsilon^2)$ parallel rounds, which is again an exponential speedup in parallel time over the existing algorithms. For monotone functions, we obtain a $1-1/e-\epsilon$ approximation in $O(\log(n/\epsilon)\log(m)/\epsilon^2)$ parallel rounds. The number of parallel rounds of our algorithm matches that of the state of the art algorithm for solving packing LPs with a linear objective (Mahoney et al., 2016). Our results apply more generally to the problem of maximizing a diminishing returns submodular (DR-submodular) function. 
    more » « less
  3. We consider the problem of performing linear regression over a stream of d-dimensional examples, and show that any algorithm that uses a subquadratic amount of memory exhibits a slower rate of convergence than can be achieved without memory constraints. Specifically, consider a sequence of labeled examples (a_1,b_1), (a_2,b_2)..., with a_i drawn independently from a d-dimensional isotropic Gaussian, and where b_i = + \eta_i, for a fixed x in R^d with ||x||= 1 and with independent noise \eta_i drawn uniformly from the interval [-2^{-d/5},2^{-d/5}]. We show that any algorithm with at most d^2/4 bits of memory requires at least \Omega(d \log \log \frac{1}{\epsilon}) samples to approximate x to \ell_2 error \epsilon with probability of success at least 2/3, for \epsilon sufficiently small as a function of d. In contrast, for such \epsilon, x can be recovered to error \epsilon with probability 1-o(1) with memory O\left(d^2 \log(1/\epsilon)\right) using d examples. This represents the first nontrivial lower bounds for regression with super-linear memory, and may open the door for strong memory/sample tradeoffs for continuous optimization. 
    more » « less
  4. We present an algorithm that, with high probability, generates a random spanning tree from an edge-weighted undirected graph in \Otil(n^{5/3 }m^{1/3}) time\footnote{The \Otil(\cdot) notation hides \poly(\log n) factors}. The tree is sampled from a distribution where the probability of each tree is proportional to the product of its edge weights. This improves upon the previous best algorithm due to Colbourn et al. that runs in matrix multiplication time, O(n^\omega). For the special case of unweighted graphs, this improves upon the best previously known running time of \tilde{O}(\min\{n^{\omega},m\sqrt{n},m^{4/3}\}) for m >> n^{7/4} (Colbourn et al. '96, Kelner-Madry '09, Madry et al. '15). The effective resistance metric is essential to our algorithm, as in the work of Madry et al., but we eschew determinant-based and random walk-based techniques used by previous algorithms. Instead, our algorithm is based on Gaussian elimination, and the fact that effective resistance is preserved in the graph resulting from eliminating a subset of vertices (called a Schur complement). As part of our algorithm, we show how to compute \eps-approximate effective resistances for a set SS of vertex pairs via approximate Schur complements in \Otil(m+(n + |S|)\eps^{-2}) time, without using the Johnson-Lindenstrauss lemma which requires \Otil( \min\{(m + |S|)\eps^{-2}, m+n\eps^{-4} +|S|\eps^{-2}\}) time. We combine this approximation procedure with an error correction procedure for handing edges where our estimate isn't sufficiently accurate. 
    more » « less
  5. We study local symmetry breaking problems in the Congest model, focusing on ruling set problems, which generalize the fundamental Maximal Independent Set (MIS) problem. The time (round) complexity of MIS (and ruling sets) have attracted much attention in the Local model. Indeed, recent results (Barenboim et al., FOCS 2012, Ghaffari SODA 2016) for the MIS problem have tried to break the long-standing O(log n)-round "barrier" achieved by Luby's algorithm, but these yield o(log n)-round complexity only when the maximum degree Delta is somewhat small relative to n. More importantly, these results apply only in the Local model. In fact, the best known time bound in the Congest model is still O(log n) (via Luby's algorithm) even for moderately small Delta (i.e., for Delta = Omega(log n) and Delta = o(n)). Furthermore, message complexity has been largely ignored in the context of local symmetry breaking. Luby's algorithm takes O(m) messages on m-edge graphs and this is the best known bound with respect to messages. Our work is motivated by the following central question: can we break the Theta(log n) time complexity barrier and the Theta(m) message complexity barrier in the Congest model for MIS or closely-related symmetry breaking problems? This paper presents progress towards this question for the distributed ruling set problem in the Congest model. A beta-ruling set is an independent set such that every node in the graph is at most beta hops from a node in the independent set. We present the following results: - Time Complexity: We show that we can break the O(log n) "barrier" for 2- and 3-ruling sets. We compute 3-ruling sets in O(log n/log log n) rounds with high probability (whp). More generally we show that 2-ruling sets can be computed in O(log Delta (log n)^(1/2 + epsilon) + log n/log log n) rounds for any epsilon > 0, which is o(log n) for a wide range of Delta values (e.g., Delta = 2^(log n)^(1/2-epsilon)). These are the first 2- and 3-ruling set algorithms to improve over the O(log n)-round complexity of Luby's algorithm in the Congest model. - Message Complexity: We show an Omega(n^2) lower bound on the message complexity of computing an MIS (i.e., 1-ruling set) which holds also for randomized algorithms and present a contrast to this by showing a randomized algorithm for 2-ruling sets that, whp, uses only O(n log^2 n) messages and runs in O(Delta log n) rounds. This is the first message-efficient algorithm known for ruling sets, which has message complexity nearly linear in n (which is optimal up to a polylogarithmic factor). 
    more » « less