skip to main content

This content will become publicly available on February 22, 2023

Title: Sublinear Time Approximation of Text Similarity Matrices
We study algorithms for approximating pairwise similarity matrices that arise in natural language processing. Generally, computing a similarity matrix for $n$ data points requires $\Omega(n^2)$ similarity computations. This quadratic scaling is a significant bottleneck, especially when similarities are computed via expensive functions, e.g., via transformer models. Approximation methods reduce this quadratic complexity, often by using a small subset of exactly computed similarities to approximate the remainder of the complete pairwise similarity matrix. Significant work focuses on the efficient approximation of positive semidefinite (PSD) similarity matrices, which arise e.g., in kernel methods. However, much less is understood about indefinite (non-PSD) similarity matrices, which often arise in NLP. Motivated by the observation that many of these matrices are still somewhat close to PSD, we introduce a generalization of the popular \emph{Nystrom method} to the indefinite setting. Our algorithm can be applied to any similarity matrix and runs in sublinear time in the size of the matrix, producing a rank-$s$ approximation with just $O(ns)$ similarity computations. We show that our method, along with a simple variant of CUR decomposition, performs very well in approximating a variety of similarity matrices arising in NLP tasks. We demonstrate high accuracy of the approximated similarity matrices in tasks of more » document classification, sentence similarity, and cross-document coreference. « less
Authors:
; ; ;
Award ID(s):
1763618 2046235
Publication Date:
NSF-PAR ID:
10326696
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
ISSN:
2159-5399
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient n is often of the order of tens of millions and the Hessian has n^ 2 elements. Consequently, computing and storing a full n times n BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward network models with eithermore »nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.« less
  2. We investigate the approximability of the following optimization problem. The input is an n× n matrix A=(Aij) with real entries and an origin-symmetric convex body K⊂ ℝn that is given by a membership oracle. The task is to compute (or approximate) the maximum of the quadratic form ∑i=1n∑j=1n Aij xixj=⟨ x,Ax⟩ as x ranges over K. This is a rich and expressive family of optimization problems; for different choices of matrices A and convex bodies K it includes a diverse range of optimization problems like max-cut, Grothendieck/non-commutative Grothendieck inequalities, small set expansion and more. While the literature studied these special cases using case-specific reasoning, here we develop a general methodology for treatment of the approximability and inapproximability aspects of these questions. The underlying geometry of K plays a critical role; we show under commonly used complexity assumptions that polytime constant-approximability necessitates that K has type-2 constant that grows slowly with n. However, we show that even when the type-2 constant is bounded, this problem sometimes exhibits strong hardness of approximation. Thus, even within the realm of type-2 bodies, the approximability landscape is nuanced and subtle. However, the link that we establish between optimization and geometry of Banach spaces allows usmore »to devise a generic algorithmic approach to the above problem. We associate to each convex body a new (higher dimensional) auxiliary set that is not convex, but is approximately convex when K has a bounded type-2 constant. If our auxiliary set has an approximate separation oracle, then we design an approximation algorithm for the original quadratic optimization problem, using an approximate version of the ellipsoid method. Even though our hardness result implies that such an oracle does not exist in general, this new question can be solved in specific cases of interest by implementing a range of classical tools from functional analysis, most notably the deep factorization theory of linear operators. Beyond encompassing the scenarios in the literature for which constant-factor approximation algorithms were found, our generic framework implies that that for convex sets with bounded type-2 constant, constant factor approximability is preserved under the following basic operations: (a) Subspaces, (b) Quotients, (c) Minkowski Sums, (d) Complex Interpolation. This yields a rich family of new examples where constant factor approximations are possible, which were beyond the reach of previous methods. We also show (under commonly used complexity assumptions) that for symmetric norms and unitarily invariant matrix norms the type-2 constant nearly characterizes the approximability of quadratic maximization.« less
  3. We develop effective approximation methods for unitary matrices. In our formulation, a unitary matrix is represented as a product of rotations in two-dimensional subspaces, so-called Givens rotations. Instead of the quadratic dimension dependence when applying a dense matrix, applying such an approximation scales with the number factors, each of which can be implemented efficiently. Consequently, in settings where an approximation is once computed and then applied many times, such an effective representation becomes advantageous. Although efficient Givens factorizations are not possible for generic unitary operators, we show that minimizing a sparsity-inducing objective with a coordinate descent algorithm on the unitary group yields good factorizations for structured matrices. Canonical applications of such a setup are orthogonal basis transforms. We demonstrate that our methods improve the approximate representation of the graph Fourier transform, the matrix obtained when diagonalizing a graph Laplacian.
  4. We develop effective approximation methods for unitary matrices. In our formulation, a unitary matrix is represented as a product of rotations in two-dimensional subspaces, so-called Givens rotations. Instead of the quadratic dimension dependence when applying a dense matrix, applying such an approximation scales with the number factors, each of which can be implemented efficiently. Consequently, in settings where an approximation is once computed and then applied many times, such an effective representation becomes advantageous. Although efficient Givens factorizations are not possible for generic unitary operators, we show that minimizing a sparsity-inducing objective with a coordinate descent algorithm on the unitary group yields good factorizations for structured matrices. Canonical applications of such a setup are orthogonal basis transforms. We demonstrate that our methods improve the approximate representation of the graph Fourier transform, the matrix obtained when diagonalizing a graph Laplacian.
  5. In this paper, we propose a new class of operator factorization methods to discretize the integral fractional Laplacian \begin{document}$ (- \Delta)^\frac{{ \alpha}}{{2}} $\end{document} for \begin{document}$ \alpha \in (0, 2) $\end{document}. One main advantage is that our method can easily increase numerical accuracy by using high-degree Lagrange basis functions, but remain its scheme structure and computer implementation unchanged. Moreover, it results in a symmetric (multilevel) Toeplitz differentiation matrix, enabling efficient computation via the fast Fourier transforms. If constant or linear basis functions are used, our method has an accuracy of \begin{document}$ {\mathcal O}(h^2) $\end{document}, while \begin{document}$ {\mathcal O}(h^4) $\end{document} for quadratic basis functions with \begin{document}$ h $\end{document} a small mesh size. This accuracy can be achieved for any \begin{document}$ \alpha \in (0, 2) $\end{document} and can be further increased if higher-degree basis functions are chosen. Numerical experiments are provided to approximate the fractional Laplacian and solve the fractional Poisson problems. It shows that if the solution of fractional Poisson problem satisfies \begin{document}$ u \in C^{m, l}(\bar{ \Omega}) $\end{document} for \begin{document}$ m \in {\mathbb N} $\end{document} and \begin{document}$ 0 < l < 1 $\end{document}, our method has an accuracy of \begin{document}$more »{\mathcal O}(h^{\min\{m+l, \, 2\}}) $\end{document} for constant and linear basis functions, while \begin{document}$ {\mathcal O}(h^{\min\{m+l, \, 4\}}) $\end{document} for quadratic basis functions. Additionally, our method can be readily applied to approximate the generalized fractional Laplacians with symmetric kernel function, and numerical study on the tempered fractional Poisson problem demonstrates its efficiency.

    « less