skip to main content

Title: Optimal tuning of weighted kNN- and diffusion-based methods for denoising single cell genomics data
The analysis of single-cell genomics data presents several statistical challenges, and extensive efforts have been made to produce methods for the analysis of this data that impute missing values, address sampling issues and quantify and correct for noise. In spite of such efforts, no consensus on best practices has been established and all current approaches vary substantially based on the available data and empirical tests. The k-Nearest Neighbor Graph (kNN-G) is often used to infer the identities of, and relationships between, cells and is the basis of many widely used dimensionality-reduction and projection methods. The kNN-G has also been the basis for imputation methods using, e.g ., neighbor averaging and graph diffusion. However, due to the lack of an agreed-upon optimal objective function for choosing hyperparameters, these methods tend to oversmooth data, thereby resulting in a loss of information with regard to cell identity and the specific gene-to-gene patterns underlying regulatory mechanisms. In this paper, we investigate the tuning of kNN- and diffusion-based denoising methods with a novel non-stochastic method for optimally preserving biologically relevant informative variance in single-cell data. The framework, Denoising Expression data with a Weighted Affinity Kernel and Self-Supervision (DEWÄKSS), uses a self-supervised technique to tune its parameters. We demonstrate that denoising with optimal parameters selected by our objective function (i) is robust to preprocessing methods using data from established benchmarks, (ii) disentangles cellular identity and maintains robust clusters over dimension-reduction methods, (iii) maintains variance along several expression dimensions, unlike previous heuristic-based methods that tend to oversmooth data variance, and (iv) rarely involves diffusion but rather uses a fixed weighted kNN graph for denoising. Together, these findings provide a new understanding of kNN- and diffusion-based denoising methods. Code and example data for DEWÄKSS is available at .  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ;
Nie, Qing
Date Published:
Journal Name:
PLOS Computational Biology
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Background

    A cell exhibits a variety of responses to internal and external cues. These responses are possible, in part, due to the presence of an elaborate gene regulatory network (GRN) in every single cell. In the past 20 years, many groups worked on reconstructing the topological structure of GRNs from large-scale gene expression data using a variety of inference algorithms. Insights gained about participating players in GRNs may ultimately lead to therapeutic benefits. Mutual information (MI) is a widely used metric within this inference/reconstruction pipeline as it can detect any correlation (linear and non-linear) between any number of variables (n-dimensions). However, the use of MI with continuous data (for example, normalized fluorescence intensity measurement of gene expression levels) is sensitive to data size, correlation strength and underlying distributions, and often requires laborious and, at times, ad hoc optimization.


    In this work, we first show that estimating MI of a bi- and tri-variate Gaussian distribution usingk-nearest neighbor (kNN) MI estimation results in significant error reduction as compared to commonly used methods based on fixed binning. Second, we demonstrate that implementing the MI-based kNN Kraskov–Stoögbauer–Grassberger (KSG) algorithm leads to a significant improvement in GRN reconstruction for popular inference algorithms, such as Context Likelihood of Relatedness (CLR). Finally, through extensive in-silico benchmarking we show that a new inference algorithm CMIA (Conditional Mutual Information Augmentation), inspired by CLR, in combination with the KSG-MI estimator, outperforms commonly used methods.


    Using three canonical datasets containing 15 synthetic networks, the newly developed method for GRN reconstruction—which combines CMIA, and the KSG-MI estimator—achieves an improvement of 20–35% in precision-recall measures over the current gold standard in the field. This new method will enable researchers to discover new gene interactions or better choose gene candidates for experimental validations.

    more » « less
  2. Abstract

    With the aim of analyzing large-sized multidimensional single-cell datasets, we are describing a method for Cosine-based Tanimoto similarity-refined graph for community detection using Leiden’s algorithm (CosTaL). As a graph-based clustering method, CosTaL transforms the cells with high-dimensional features into a weighted k-nearest-neighbor (kNN) graph. The cells are represented by the vertices of the graph, while an edge between two vertices in the graph represents the close relatedness between the two cells. Specifically, CosTaL builds an exact kNN graph using cosine similarity and uses the Tanimoto coefficient as the refining strategy to re-weight the edges in order to improve the effectiveness of clustering. We demonstrate that CosTaL generally achieves equivalent or higher effectiveness scores on seven benchmark cytometry datasets and six single-cell RNA-sequencing datasets using six different evaluation metrics, compared with other state-of-the-art graph-based clustering methods, including PhenoGraph, Scanpy and PARC. As indicated by the combined evaluation metrics, Costal has high efficiency with small datasets and acceptable scalability for large datasets, which is beneficial for large-scale analysis.

    more » « less
  3. Abstract Kernelized Gram matrix $W$ constructed from data points $\{x_i\}_{i=1}^N$ as $W_{ij}= k_0( \frac{ \| x_i - x_j \|^2} {\sigma ^2} ) $ is widely used in graph-based geometric data analysis and unsupervised learning. An important question is how to choose the kernel bandwidth $\sigma $, and a common practice called self-tuned kernel adaptively sets a $\sigma _i$ at each point $x_i$ by the $k$-nearest neighbor (kNN) distance. When $x_i$s are sampled from a $d$-dimensional manifold embedded in a possibly high-dimensional space, unlike with fixed-bandwidth kernels, theoretical results of graph Laplacian convergence with self-tuned kernels have been incomplete. This paper proves the convergence of graph Laplacian operator $L_N$ to manifold (weighted-)Laplacian for a new family of kNN self-tuned kernels $W^{(\alpha )}_{ij} = k_0( \frac{ \| x_i - x_j \|^2}{ \epsilon \hat{\rho }(x_i) \hat{\rho }(x_j)})/\hat{\rho }(x_i)^\alpha \hat{\rho }(x_j)^\alpha $, where $\hat{\rho }$ is the estimated bandwidth function by kNN and the limiting operator is also parametrized by $\alpha $. When $\alpha = 1$, the limiting operator is the weighted manifold Laplacian $\varDelta _p$. Specifically, we prove the point-wise convergence of $L_N f $ and convergence of the graph Dirichlet form with rates. Our analysis is based on first establishing a $C^0$ consistency for $\hat{\rho }$ which bounds the relative estimation error $|\hat{\rho } - \bar{\rho }|/\bar{\rho }$ uniformly with high probability, where $\bar{\rho } = p^{-1/d}$ and $p$ is the data density function. Our theoretical results reveal the advantage of the self-tuned kernel over the fixed-bandwidth kernel via smaller variance error in low-density regions. In the algorithm, no prior knowledge of $d$ or data density is needed. The theoretical results are supported by numerical experiments on simulated data and hand-written digit image data. 
    more » « less
  4. Background Cellulolytic, hemicellulolytic, and amylolytic (CHA) enzyme-producing halophiles are understudied. The recently defined taxon Iocasia fonsfrigidae consists of one well-described anaerobic bacterial strain: NS-1 T . Prior to characterization of strain NS-1 T , an isolate designated Halocella sp. SP3-1 was isolated and its genome was published. Based on physiological and genetic comparisons, it was suggested that Halocella sp. SP3-1 may be another isolate of I. fronsfrigidae . Despite being geographic variants of the same species, data indicate that strain SP3-1 exhibits genetic, genomic, and physiological characteristics that distinguish it from strain NS-1 T . In this study, we examine the halophilic and alkaliphilic nature of strain SP3-1 and the genetic substrates underlying phenotypic differences between strains SP3-1 and NS-1 T with focus on sugar metabolism and CHA enzyme expression. Methods Standard methods in anaerobic cell culture were used to grow strains SP3-1 as well as other comparator species. Morphological characterization was done via electron microscopy and Schaeffer-Fulton staining. Data for sequence comparisons ( e.g. , 16S rRNA) were retrieved via BLAST and EzBioCloud. Alignments and phylogenetic trees were generated via CLUTAL_X and neighbor joining functions in MEGA (version 11). Genomes were assembled/annotated via the Prokka annotation pipeline. Clusters of Orthologous Groups (COGs) were defined by eegNOG 4.5. DNA-DNA hybridization calculations were performed by the ANI Calculator web service. Results Cells of strain SP3-1 are rods. SP3-1 cells grow at NaCl concentrations of 5-30% (w/v). Optimal growth occurs at 37 °C, pH 8.0, and 20% NaCl (w/v). Although phylogenetic analysis based on 16S rRNA gene indicates that strain SP3-1 belongs to the genus Iocasia with 99.58% average nucleotide sequence identity to Iocasia fonsfrigida NS-1 T , strain SP3-1 is uniquely an extreme haloalkaliphile. Moreover, strain SP3-1 ferments D-glucose to acetate, butyrate, carbon dioxide, hydrogen, ethanol, and butanol and will grow on L-arabinose, D-fructose, D-galactose, D-glucose, D-mannose, D-raffinose, D-xylose, cellobiose, lactose, maltose, sucrose, starch, xylan and phosphoric acid swollen cellulose (PASC). D-rhamnose, alginate, and lignin do not serve as suitable culture substrates for strain SP3-1. Thus, the carbon utilization profile of strain SP3-1 differs from that of I. fronsfrigidae strain NS-1 T . Differences between these two strains are also noted in their lipid composition. Genomic data reveal key differences between the genetic profiles of strain SP3-1 and NS-1 T that likely account for differences in morphology, sugar metabolism, and CHA-enzyme potential. Important to this study, I. fonsfrigidae SP3-1 produces and extracellularly secretes CHA enzymes at different levels and composition than type strain NS-1 T . The high salt tolerance and pH range of SP3-1 makes it an ideal candidate for salt and pH tolerant enzyme discovery. 
    more » « less
  5. Simulation of flow and transport in petroleum reservoirs involves solving coupled systems of advection-diffusion-reaction equations with nonlinear flux functions, diffusion coefficients, and reactions/wells. It is important to develop numerical schemes that can approximate all three processes at once, and to high order, so that the physics can be well resolved. In this paper, we propose an approach based on high order, finite volume, implicit, Weighted Essentially NonOscillatory (iWENO) schemes. The resulting schemes are locally mass conservative and, being implicit, suited to systems of advection-diffusion-reaction equations. Moreover, our approach gives unconditionally L-stable schemes for smooth solutions to the linear advection-diffusion-reaction equation in the sense of a von Neumann stability analysis. To illustrate our approach, we develop a third order iWENO scheme for the saturation equation of two-phase flow in porous media in two space dimensions. The keys to high order accuracy are to use WENO reconstruction in space (which handles shocks and steep fronts) combined with a two-stage Radau-IIA Runge-Kutta time integrator. The saturation is approximated by its averages over the mesh elements at the current time level and at two future time levels; therefore, the scheme uses two unknowns per grid block per variable, independent of the spatial dimension. This makes the scheme fairly computationally efficient, both because reconstructions make use of local information that can fit in cache memory, and because the global system has about as small a number of degrees of freedom as possible. The scheme is relatively simple to implement, high order accurate, maintains local mass conservation, applies to general computational meshes, and appears to be robust. Preliminary computational tests show the potential of the scheme to handle advection-diffusion-reaction processes on meshes of quadrilateral gridblocks, and to do so to high order accuracy using relatively long time steps. The new scheme can be viewed as a generalization of standard cell-centered finite volume (or finite difference) methods. It achieves high order in both space and time, and it incorporates WENO slope limiting. 
    more » « less