skip to main content

Title: Convergence of graph Laplacian with kNN self-tuned kernels
Abstract Kernelized Gram matrix $W$ constructed from data points $\{x_i\}_{i=1}^N$ as $W_{ij}= k_0( \frac{ \| x_i - x_j \|^2} {\sigma ^2} ) $ is widely used in graph-based geometric data analysis and unsupervised learning. An important question is how to choose the kernel bandwidth $\sigma $, and a common practice called self-tuned kernel adaptively sets a $\sigma _i$ at each point $x_i$ by the $k$-nearest neighbor (kNN) distance. When $x_i$s are sampled from a $d$-dimensional manifold embedded in a possibly high-dimensional space, unlike with fixed-bandwidth kernels, theoretical results of graph Laplacian convergence with self-tuned kernels have been incomplete. This paper proves the convergence of graph Laplacian operator $L_N$ to manifold (weighted-)Laplacian for a new family of kNN self-tuned kernels $W^{(\alpha )}_{ij} = k_0( \frac{ \| x_i - x_j \|^2}{ \epsilon \hat{\rho }(x_i) \hat{\rho }(x_j)})/\hat{\rho }(x_i)^\alpha \hat{\rho }(x_j)^\alpha $, where $\hat{\rho }$ is the estimated bandwidth function by kNN and the limiting operator is also parametrized by $\alpha $. When $\alpha = 1$, the limiting operator is the weighted manifold Laplacian $\varDelta _p$. Specifically, we prove the point-wise convergence of $L_N f $ and convergence of the graph Dirichlet form with rates. Our analysis is based on first establishing a $C^0$ consistency for $\hat{\rho }$ which bounds the relative estimation error $|\hat{\rho } - \bar{\rho }|/\bar{\rho }$ uniformly with high probability, where $\bar{\rho } = p^{-1/d}$ and $p$ is the data density function. Our theoretical results reveal the advantage of the self-tuned kernel over the fixed-bandwidth kernel via smaller variance error in low-density regions. In the algorithm, no prior knowledge of $d$ or data density is needed. The theoretical results are supported by numerical experiments on simulated data and hand-written digit image data.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
Information and Inference: A Journal of the IMA
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose a new class of operator factorization methods to discretize the integral fractional Laplacian \begin{document}$ (- \Delta)^\frac{{ \alpha}}{{2}} $\end{document} for \begin{document}$ \alpha \in (0, 2) $\end{document}. One main advantage is that our method can easily increase numerical accuracy by using high-degree Lagrange basis functions, but remain its scheme structure and computer implementation unchanged. Moreover, it results in a symmetric (multilevel) Toeplitz differentiation matrix, enabling efficient computation via the fast Fourier transforms. If constant or linear basis functions are used, our method has an accuracy of \begin{document}$ {\mathcal O}(h^2) $\end{document}, while \begin{document}$ {\mathcal O}(h^4) $\end{document} for quadratic basis functions with \begin{document}$ h $\end{document} a small mesh size. This accuracy can be achieved for any \begin{document}$ \alpha \in (0, 2) $\end{document} and can be further increased if higher-degree basis functions are chosen. Numerical experiments are provided to approximate the fractional Laplacian and solve the fractional Poisson problems. It shows that if the solution of fractional Poisson problem satisfies \begin{document}$ u \in C^{m, l}(\bar{ \Omega}) $\end{document} for \begin{document}$ m \in {\mathbb N} $\end{document} and \begin{document}$ 0 < l < 1 $\end{document}, our method has an accuracy of \begin{document}$ {\mathcal O}(h^{\min\{m+l, \, 2\}}) $\end{document} for constant and linear basis functions, while \begin{document}$ {\mathcal O}(h^{\min\{m+l, \, 4\}}) $\end{document} for quadratic basis functions. Additionally, our method can be readily applied to approximate the generalized fractional Laplacians with symmetric kernel function, and numerical study on the tempered fractional Poisson problem demonstrates its efficiency.

    more » « less
  2. Given only a finite collection of points sampled from a Riemannian manifold embedded in a Euclidean space, in this paper we propose a new method to numerically solve elliptic and parabolic partial differential equations (PDEs) supplemented with boundary conditions. Since the construction of triangulations on unknown manifolds can be both difficult and expensive, both in terms of computational and data requirements, our goal is to solve these problems without a triangulation. Instead, we rely only on using the sample points to define quadrature formulas on the unknown manifold. Our main tool is the diffusion maps algorithm. We re-analyze this well-known method in a variational sense for manifolds with boundary. Our main result is that the variational diffusion maps graph Laplacian is a consistent estimator of the Dirichlet energy on the manifold. This improves upon previous results and provides a rigorous justification of the well-known relationship between diffusion maps and the Neumann eigenvalue problem. Moreover, using semigeodesic coordinates we derive the first uniform asymptotic expansion of the diffusion maps kernel integral operator for manifolds with boundary. This expansion relies on a novel lemma which relates the extrinsic Euclidean distance to the coordinate norm in a normal collar of the boundary. We then use a recently developed method of estimating the distance to boundary function (notice that the boundary location is assumed to be unknown) to construct a consistent estimator for boundary integrals. Finally, by combining these various estimators, we illustrate how to impose Dirichlet and Neumann conditions for some common PDEs based on the Laplacian. Several numerical examples illustrate our theoretical findings. 
    more » « less
  3. Nie, Qing (Ed.)
    The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g ., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at . 
    more » « less
  4. null (Ed.)
    ABSTRACT We report three-dimensional hydrodynamical simulations of shocks (${\cal M_{\rm shock}}\ge 4$) interacting with fractal multicloud layers. The evolution of shock–multicloud systems consists of four stages: a shock-splitting phase in which reflected and refracted shocks are generated, a compression phase in which the forward shock compresses cloud material, an expansion phase triggered by internal heating and shock re-acceleration, and a mixing phase in which shear instabilities generate turbulence. We compare multicloud layers with narrow ($\sigma _{\rho }=1.9\bar{\rho }$) and wide ($\sigma _{\rho }=5.9\bar{\rho }$) lognormal density distributions characteristic of Mach ≈ 5 supersonic turbulence driven by solenoidal and compressive modes. Our simulations show that outflowing cloud material contains imprints of the density structure of their native environments. The dynamics and disruption of multicloud systems depend on the porosity and the number of cloudlets in the layers. ‘Solenoidal’ layers mix less, generate less turbulence, accelerate faster, and form a more coherent mixed-gas shell than the more porous ‘compressive’ layers. Similarly, multicloud systems with more cloudlets quench mixing via a shielding effect and enhance momentum transfer. Mass loading of diffuse mixed gas is efficient in all models, but direct dense gas entrainment is highly inefficient. Dense gas only survives in compressive clouds, but has low speeds. If normalized with respect to the shock-passage time, the evolution shows invariance for shock Mach numbers ≥10 and different cloud-generating seeds, and slightly weaker scaling for lower Mach numbers and thinner cloud layers. Multicloud systems also have better convergence properties than single-cloud systems, with a resolution of eight cells per cloud radius being sufficient to capture their overall dynamics. 
    more » « less
  5. Abstract

    In the (special) smoothing spline problem one considers a variational problem with a quadratic data fidelity penalty and Laplacian regularization. Higher order regularity can be obtained via replacing the Laplacian regulariser with a poly-Laplacian regulariser. The methodology is readily adapted to graphs and here we consider graph poly-Laplacian regularization in a fully supervised, non-parametric, noise corrupted, regression problem. In particular, given a dataset$$\{x_i\}_{i=1}^n$${xi}i=1nand a set of noisy labels$$\{y_i\}_{i=1}^n\subset \mathbb {R}$${yi}i=1nRwe let$$u_n{:}\{x_i\}_{i=1}^n\rightarrow \mathbb {R}$$un:{xi}i=1nRbe the minimizer of an energy which consists of a data fidelity term and an appropriately scaled graph poly-Laplacian term. When$$y_i = g(x_i)+\xi _i$$yi=g(xi)+ξi, for iid noise$$\xi _i$$ξi, and using the geometric random graph, we identify (with high probability) the rate of convergence of$$u_n$$untogin the large data limit$$n\rightarrow \infty $$n. Furthermore, our rate is close to the known rate of convergence in the usual smoothing spline model.

    more » « less