skip to main content


Title: Supervised capacity preserving mapping: a clustering guided visualization method for scRNA-seq data
Abstract Motivation

The rapid development of scRNA-seq technologies enables us to explore the transcriptome at the cell level on a large scale. Recently, various computational methods have been developed to analyze the scRNAseq data, such as clustering and visualization. However, current visualization methods, including t-SNE and UMAP, are challenged by the limited accuracy of rendering the geometric relationship of populations with distinct functional states. Most visualization methods are unsupervised, leaving out information from the clustering results or given labels. This leads to the inaccurate depiction of the distances between the bona fide functional states. In particular, UMAP and t-SNE are not optimal to preserve the global geometric structure. They may result in a contradiction that clusters with near distance in the embedded dimensions are in fact further away in the original dimensions. Besides, UMAP and t-SNE cannot track the variance of clusters. Through the embedding of t-SNE and UMAP, the variance of a cluster is not only associated with the true variance but also is proportional to the sample size.

Results

We present supCPM, a robust supervised visualization method, which separates different clusters, preserves the global structure and tracks the cluster variance. Compared with six visualization methods using synthetic and real datasets, supCPM shows improved performance than other methods in preserving the global geometric structure and data variance. Overall, supCPM provides an enhanced visualization pipeline to assist the interpretation of functional transition and accurately depict population segregation.

Availability and implementation

The R package and source code are available at https://zenodo.org/record/5975977#.YgqR1PXMJjM.

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
Award ID(s):
2107215
NSF-PAR ID:
10366634
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
38
Issue:
9
ISSN:
1367-4803
Format(s):
Medium: X Size: p. 2496-2503
Size(s):
["p. 2496-2503"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract The t-distributed stochastic neighbor embedding (t-SNE) method is one of the leading techniques for data visualization and clustering. This method finds lower-dimensional embedding of data points while minimizing distortions in distances between neighboring data points. By construction, t-SNE discards information about large-scale structure of the data. We show that adding a global cost function to the t-SNE cost function makes it possible to cluster the data while preserving global intercluster data structure. We test the new global t-SNE (g-SNE) method on one synthetic and two real data sets on flower shapes and human brain cells. We find that significant and meaningful global structure exists in both the plant and human brain data sets. In all cases, g-SNE outperforms t-SNE and UMAP in preserving the global structure. Topological analysis of the clustering result makes it possible to find an appropriate trade-off of data distribution across scales. We find differences in how data are distributed across scales between the two subjects that were part of the human brain data set. Thus, by striving to produce both accurate clustering and positioning between clusters, the g-SNE method can identify new aspects of data organization across scales. 
    more » « less
  2. Abstract Motivation

    Single-molecule localization microscopy (SMLM) can play an important role in integrated structural biology approaches to identify, localize and determine the 3D structure of cellular structures. While many tools exist for the 3D analysis and visualization of crystal or cryo-EM structures little exists for 3D SMLM data, which can provide unique insights but are particularly challenging to analyze in three dimensions especially in a dense cellular context.

    Results

    We developed 3DClusterViSu, a method based on 3D Voronoi tessellations that allows local density estimation, segmentation and quantification of 3D SMLM data and visualization of protein clusters within a 3D tool. We show its robust performance on microtubules and histone proteins H2B and CENP-A with distinct spatial distributions. 3DClusterViSu will favor multi-scale and multi-resolution synergies to allow integrating molecular and cellular levels in the analysis of macromolecular complexes.

    Availability and impementation

    3DClusterViSu is available under http://cbi-dev.igbmc.fr/cbi/voronoi3D.

    Supplementary information

    Supplementary figures are available at Bioinformatics online.

     
    more » « less
  3. Abstract Motivation

    Cancer phylogenies are key to studying tumorigenesis and have clinical implications. Due to the heterogeneous nature of cancer and limitations in current sequencing technology, current cancer phylogeny inference methods identify a large solution space of plausible phylogenies. To facilitate further downstream analyses, methods that accurately summarize such a set T of cancer phylogenies are imperative. However, current summary methods are limited to a single consensus tree or graph and may miss important topological features that are present in different subsets of candidate trees.

    Results

    We introduce the Multiple Consensus Tree (MCT) problem to simultaneously cluster T and infer a consensus tree for each cluster. We show that MCT is NP-hard, and present an exact algorithm based on mixed integer linear programming (MILP). In addition, we introduce a heuristic algorithm that efficiently identifies high-quality consensus trees, recovering all optimal solutions identified by the MILP in simulated data at a fraction of the time. We demonstrate the applicability of our methods on both simulated and real data, showing that our approach selects the number of clusters depending on the complexity of the solution space T.

    Availability and implementation

    https://github.com/elkebir-group/MCT.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract Motivation

    As an increasing amount of protein–protein interaction (PPI) data becomes available, their computational interpretation has become an important problem in bioinformatics. The alignment of PPI networks from different species provides valuable information about conserved subnetworks, evolutionary pathways and functional orthologs. Although several methods have been proposed for global network alignment, there is a pressing need for methods that produce more accurate alignments in terms of both topological and functional consistency.

    Results

    In this work, we present a novel global network alignment algorithm, named ModuleAlign, which makes use of local topology information to define a module-based homology score. Based on a hierarchical clustering of functionally coherent proteins involved in the same module, ModuleAlign employs a novel iterative scheme to find the alignment between two networks. Evaluated on a diverse set of benchmarks, ModuleAlign outperforms state-of-the-art methods in producing functionally consistent alignments. By aligning Pathogen–Human PPI networks, ModuleAlign also detects a novel set of conserved human genes that pathogens preferentially target to cause pathogenesis.

    Availability

    http://ttic.uchicago.edu/∼hashemifar/ModuleAlign.html

    Contact

    canzar@ttic.edu or j3xu.ttic.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    Multistate protein design addresses real-world challenges, such as multi-specificity design and backbone flexibility, by considering both positive and negative protein states with an ensemble of substates for each. It also presents an enormous challenge to exact algorithms that guarantee the optimal solutions and enable a direct test of mechanistic hypotheses behind models. However, efficient exact algorithms are lacking for multistate protein design.

    Results

    We have developed an efficient exact algorithm called interconnected cost function networks (iCFN) for multistate protein design. Its generic formulation allows for a wide array of applications such as stability, affinity and specificity designs while addressing concerns such as global flexibility of protein backbones. iCFN treats each substate design as a weighted constraint satisfaction problem (WCSP) modeled through a CFN; and it solves the coupled WCSPs using novel bounds and a depth-first branch-and-bound search over a tree structure of sequences, substates, and conformations. When iCFN is applied to specificity design of a T-cell receptor, a problem of unprecedented size to exact methods, it drastically reduces search space and running time to make the problem tractable. Moreover, iCFN generates experimentally-agreeing receptor designs with improved accuracy compared with state-of-the-art methods, highlights the importance of modeling backbone flexibility in protein design, and reveals molecular mechanisms underlying binding specificity.

    Availability and implementation

    https://shen-lab.github.io/software/iCFN

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less