Kernel methods offer the flexibility to learn complex relationships in modern, large data sets while enjoying strong theoretical guarantees on quality. Unfortunately, these methods typically require cubic running time in the data set size, a prohibitive cost in the largedata setting. Random feature maps (RFMs) and the Nyström method both consider lowrank approximations to the kernel matrix as a potential solution. But, in order to achieve desirable theoretical guarantees, the former may require a prohibitively large number of features J+, and the latter may be prohibitively expensive for highdimensional problems. We propose to combine the simplicity and generality of RFMs with a datadependent feature selection scheme to achieve desirable theoretical approximation properties of Nyström with just O(\log J+) features. Our key insight is to begin with a large set of random features, then reduce them to a small number of weighted features in a datadependent, computationally efficient way, while preserving the statistical guarantees of using the original large set of features. We demonstrate the efficacy of our method with theory and experimentsincluding on a data set with over 50 million observations. In particular, we show that our method achieves small kernel matrix approximation error and better test set accuracy withmore »
Solving Interpretable Kernel Dimensionality Reduction
Kernel dimensionality reduction (KDR) algorithms find a low dimensional representation of the original data by optimizing kernel dependency measures that are capable of capturing nonlinear relationships. The standard strategy is to first map the data into a high dimensional feature space using kernels prior to a projection onto a low dimensional space. While KDR methods can be easily solved by keeping the most dominant eigenvectors of the kernel matrix, its features are no longer easy to interpret. Alternatively, Interpretable KDR (IKDR) is different in that it projects onto a subspace \textit{before} the kernel feature mapping, therefore, the projection matrix can indicate how the original features linearly combine to form the new features. Unfortunately, the IKDR objective requires a nonconvex manifold optimization that is difficult to solve and can no longer be solved by eigendecomposition. Recently, an efficient iterative spectral (eigendecomposition) method (ISM) has been proposed for this objective in the context of alternative clustering. However, ISM only provides theoretical guarantees for the Gaussian kernel. This greatly constrains ISM's usage since any kernel method using ISM is now limited to a single kernel. This work extends the theoretical guarantees of ISM to an entire family of kernels, thereby empowering ISM to more »
 Award ID(s):
 1638234
 Publication Date:
 NSFPAR ID:
 10176072
 Journal Name:
 Advances in Neural Information Processing Systems 32 (NIPS 2019)
 Volume:
 32
 Page Range or eLocationID:
 79157925
 Sponsoring Org:
 National Science Foundation
More Like this


Nyström approximation is a fast randomized method that rapidly solves kernel ridge regression (KRR) problems through subsampling the nbyn empirical kernel matrix appearing in the objective function. However, the performance of such a subsampling method heavily relies on correctly estimating the statistical leverage scores for forming the sampling distribution, which can be as costly as solving the original KRR. In this work, we propose a linear time (modulo polylog terms) algorithm to accurately approximate the statistical leverage scores in the stationarykernelbased KRR with theoretical guarantees. Particularly, by analyzing the firstorder condition of the KRR objective, we derive an analytic formula, which depends on both the input distribution and the spectral density of stationary kernels, for capturing the nonuniformity of the statistical leverage scores. Numerical experiments demonstrate that with the same prediction accuracy our method is orders of magnitude more efficient than existing methods in selecting the representative subsamples in the Nyström approximation.

Gaussian processes (GPs) provide flexible distributions over functions, with inductive biases controlled by a kernel. However, in many applications Gaussian processes can struggle with even moderate input dimensionality. Learning a low dimensional projection can help alleviate this curse of dimensionality, but introduces many trainable hyperparameters, which can be cumbersome, especially in the small data regime. We use additive sums of kernels for GP regression, where each kernel operates on a different random projection of its inputs. Surprisingly, we find that as the number of random projections increases, the predictive performance of this approach quickly converges to the performance of a kernel operating on the original full dimensional inputs, over a wide range of data sets, even if we are projecting into a single dimension. As a consequence, many problems can remarkably be reduced to one dimensional input spaces, without learning a transformation. We prove this convergence and its rate, and additionally propose a deterministic approach that converges more quickly than purely random projections. Moreover, we demonstrate our approach can achieve faster inference and improved predictive accuracy for highdimensional inputs compared to kernels in the original input space.

The Sylvester equation offers a powerful and unifying primitive for a variety of important graph mining tasks, including network alignment, graph kernel, node similarity, subgraph matching, etc. A major bottleneck of Sylvester equation lies in its high computational complexity. Despite tremendous effort, stateoftheart methods still require a complexity that is at least \em quadratic in the number of nodes of graphs, even with approximations. In this paper, we propose a family of Krylov subspace based algorithms (\fasten) to speed up and scale up the computation of Sylvester equation for graph mining. The key idea of the proposed methods is to project the original equivalent linear system onto a Kronecker Krylov subspace. We further exploit (1) the implicit representation of the solution matrix as well as the associated computation, and (2) the decomposition of the original Sylvester equation into a set of intercorrelated Sylvester equations of smaller size. The proposed algorithms bear two distinctive features. First, they provide the \em exact solutions without any approximation error. Second, they significantly reduce the time and space complexity for solving Sylvester equation, with two of the proposed algorithms having a \em linear complexity in both time and space. Experimental evaluations on a diverse setmore »

Abstract Kernelized Gram matrix $W$ constructed from data points $\{x_i\}_{i=1}^N$ as $W_{ij}= k_0( \frac{ \ x_i  x_j \^2} {\sigma ^2} ) $ is widely used in graphbased geometric data analysis and unsupervised learning. An important question is how to choose the kernel bandwidth $\sigma $, and a common practice called selftuned kernel adaptively sets a $\sigma _i$ at each point $x_i$ by the $k$nearest neighbor (kNN) distance. When $x_i$s are sampled from a $d$dimensional manifold embedded in a possibly highdimensional space, unlike with fixedbandwidth kernels, theoretical results of graph Laplacian convergence with selftuned kernels have been incomplete. This paper proves the convergence of graph Laplacian operator $L_N$ to manifold (weighted)Laplacian for a new family of kNN selftuned kernels $W^{(\alpha )}_{ij} = k_0( \frac{ \ x_i  x_j \^2}{ \epsilon \hat{\rho }(x_i) \hat{\rho }(x_j)})/\hat{\rho }(x_i)^\alpha \hat{\rho }(x_j)^\alpha $, where $\hat{\rho }$ is the estimated bandwidth function by kNN and the limiting operator is also parametrized by $\alpha $. When $\alpha = 1$, the limiting operator is the weighted manifold Laplacian $\varDelta _p$. Specifically, we prove the pointwise convergence of $L_N f $ and convergence of the graph Dirichlet form with rates. Our analysis is based on first establishing a $C^0$more »