skip to main content

This content will become publicly available on July 4, 2024

Title: Data‐driven linear complexity low‐rank approximation of general kernel matrices: A geometric approach
Summary A general, rectangular kernel matrix may be defined as where is a kernel function and where and are two sets of points. In this paper, we seek a low‐rank approximation to a kernel matrix where the sets of points and are large and are arbitrarily distributed, such as away from each other, “intermingled”, identical, and so forth. Such rectangular kernel matrices may arise, for example, in Gaussian process regression where corresponds to the training data and corresponds to the test data. In this case, the points are often high‐dimensional. Since the point sets are large, we must exploit the fact that the matrix arises from a kernel function, and avoid forming the matrix, and thus ruling out most algebraic techniques. In particular, we seek methods that can scale linearly or nearly linearly with respect to the size of data for a fixed approximation rank. The main idea in this paper is to geometrically select appropriate subsets of points to construct a low rank approximation. An analysis in this paper guides how this selection should be performed.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Numerical Linear Algebra with Applications
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Kernel matrices, as well as weighted graphs represented by them, are ubiquitous objects in machine learning, statistics and other related fields. The main drawback of using kernel methods (learning and inference using kernel matrices) is efficiency – given n input points, most kernel-based algorithms need to materialize the full n × n kernel matrix before performing any subsequent computation, thus incurring Ω(n^2) runtime. Breaking this quadratic barrier for various problems has therefore, been a subject of extensive research efforts. We break the quadratic barrier and obtain subquadratic time algorithms for several fundamental linear-algebraic and graph processing primitives, including approximating the top eigenvalue and eigenvector, spectral sparsification, solving lin- ear systems, local clustering, low-rank approximation, arboricity estimation and counting weighted triangles. We build on the recently developed Kernel Density Estimation framework, which (after preprocessing in time subquadratic in n) can return estimates of row/column sums of the kernel matrix. In particular, we de- velop efficient reductions from weighted vertex and weighted edge sampling on kernel graphs, simulating random walks on kernel graphs, and importance sampling on matrices to Kernel Density Estimation and show that we can generate samples from these distributions in sublinear (in the support of the distribution) time. Our reductions are the central ingredient in each of our applications and we believe they may be of independent interest. We empirically demonstrate the efficacy of our algorithms on low-rank approximation (LRA) and spectral sparsi- fication, where we observe a 9x decrease in the number of kernel evaluations over baselines for LRA and a 41x reduction in the graph size for spectral sparsification. 
    more » « less
  2. Kernel methods offer the flexibility to learn complex relationships in modern, large data sets while enjoying strong theoretical guarantees on quality. Unfortunately, these methods typically require cubic running time in the data set size, a prohibitive cost in the large-data setting. Random feature maps (RFMs) and the Nyström method both consider low-rank approximations to the kernel matrix as a potential solution. But, in order to achieve desirable theoretical guarantees, the former may require a prohibitively large number of features J+, and the latter may be prohibitively expensive for high-dimensional problems. We propose to combine the simplicity and generality of RFMs with a data-dependent feature selection scheme to achieve desirable theoretical approximation properties of Nyström with just O(\log J+) features. Our key insight is to begin with a large set of random features, then reduce them to a small number of weighted features in a data-dependent, computationally efficient way, while preserving the statistical guarantees of using the original large set of features. We demonstrate the efficacy of our method with theory and experiments-including on a data set with over 50 million observations. In particular, we show that our method achieves small kernel matrix approximation error and better test set accuracy with provably fewer random features than state-of-the-art methods. 
    more » « less
  3. Summary

    This paper presents an efficient method to perform structured matrix approximation by separation and hierarchy (SMASH), when the original dense matrix is associated with a kernel function. Given the points in a domain, a tree structure is first constructed based on an adaptive partition of the computational domain to facilitate subsequent approximation procedures. In contrast to existing schemes based on either analytic or purely algebraic approximations, SMASH takes advantage of both approaches and greatly improves efficiency. The algorithm follows a bottom‐up traversal of the tree and is able to perform the operations associated with each node on the same level in parallel. A strong rank‐revealing factorization is applied to the initial analytic approximation in theseparationregime so that a special structure is incorporated into the final nested bases. As a consequence, the storage is significantly reduced on one hand and a hierarchy of the original grid is constructed on the other hand. Due to this hierarchy, nested bases at upper levels can be computed in a similar way as the leaf level operations but on coarser grids. The main advantages of SMASH include its simplicity of implementation, its flexibility to construct various hierarchical rank structures, and a low storage cost. The efficiency and robustness of SMASH are demonstrated through various test problems arising from integral equations, structured matrices, etc.

    more » « less
  4. Abstract Kernelized Gram matrix $W$ constructed from data points $\{x_i\}_{i=1}^N$ as $W_{ij}= k_0( \frac{ \| x_i - x_j \|^2} {\sigma ^2} ) $ is widely used in graph-based geometric data analysis and unsupervised learning. An important question is how to choose the kernel bandwidth $\sigma $, and a common practice called self-tuned kernel adaptively sets a $\sigma _i$ at each point $x_i$ by the $k$-nearest neighbor (kNN) distance. When $x_i$s are sampled from a $d$-dimensional manifold embedded in a possibly high-dimensional space, unlike with fixed-bandwidth kernels, theoretical results of graph Laplacian convergence with self-tuned kernels have been incomplete. This paper proves the convergence of graph Laplacian operator $L_N$ to manifold (weighted-)Laplacian for a new family of kNN self-tuned kernels $W^{(\alpha )}_{ij} = k_0( \frac{ \| x_i - x_j \|^2}{ \epsilon \hat{\rho }(x_i) \hat{\rho }(x_j)})/\hat{\rho }(x_i)^\alpha \hat{\rho }(x_j)^\alpha $, where $\hat{\rho }$ is the estimated bandwidth function by kNN and the limiting operator is also parametrized by $\alpha $. When $\alpha = 1$, the limiting operator is the weighted manifold Laplacian $\varDelta _p$. Specifically, we prove the point-wise convergence of $L_N f $ and convergence of the graph Dirichlet form with rates. Our analysis is based on first establishing a $C^0$ consistency for $\hat{\rho }$ which bounds the relative estimation error $|\hat{\rho } - \bar{\rho }|/\bar{\rho }$ uniformly with high probability, where $\bar{\rho } = p^{-1/d}$ and $p$ is the data density function. Our theoretical results reveal the advantage of the self-tuned kernel over the fixed-bandwidth kernel via smaller variance error in low-density regions. In the algorithm, no prior knowledge of $d$ or data density is needed. The theoretical results are supported by numerical experiments on simulated data and hand-written digit image data. 
    more » « less
  5. Abstract

    High‐resolution characterization of hydraulic properties and dense nonaqueous phase liquid (DNAPL) contaminant source is crucial to develop efficient remediation strategies. However, DNAPL characterization suffers from a limited number of borehole data in the field, resulting in a low‐resolution estimation. Moreover, high‐resolution DNAPL characterization requires a large number of unknowns to be estimated, presenting a computational bottleneck. In this paper, a low‐cost geophysical approach, the self‐potential method, is used as additional information for hydraulic properties characterization. Joint inversion of hydraulic head and self‐potential measurements is proposed to improve hydraulic conductivity estimation, which is then used to characterize the DNAPL saturation distribution by inverting partitioning tracer measurements. The computational barrier is overcome by (a) solving the inversion by the principal component geostatistical approach, in which the covariance matrix is replaced by a low‐rank approximation, thus reducing the number of forward model runs; (b) using temporal moments of concentrations instead of individual concentration data points for faster forward simulations. To assess the ability of the proposed approach, numerical experiments are conducted in a 3‐D aquifer with 104unknown hydraulic conductivities and DNAPL saturations. Results show that with realistic DNAPL sources and a limited number of hydraulic heads, the traditional hydraulic/partitioning tracer tomography roughly reconstructs subsurface heterogeneity but fails to resolve the DNAPL distribution. By adding self‐potential data, the error is reduced by 24% in hydraulic conductivity estimation and 68% in DNAPL saturation characterization. The proposed sequential inversion framework utilizes the complementary information from multi‐source hydrogeophysical data sets, and can provide high‐resolution characterizations for realistic DNAPL sources.

    more » « less