skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 12 until 2:00 AM ET on Saturday, July 13 due to maintenance. We apologize for the inconvenience.


Title: JIND: joint integration and discrimination for automated single-cell annotation
Abstract Motivation An important step in the transcriptomic analysis of individual cells involves manually determining the cellular identities. To ease this labor-intensive annotation of cell-types, there has been a growing interest in automated cell annotation, which can be achieved by training classification algorithms on previously annotated datasets. Existing pipelines employ dataset integration methods to remove potential batch effects between source (annotated) and target (unannotated) datasets. However, the integration and classification steps are usually independent of each other and performed by different tools. We propose JIND (joint integration and discrimination for automated single-cell annotation), a neural-network-based framework for automated cell-type identification that performs integration in a space suitably chosen to facilitate cell classification. To account for batch effects, JIND performs a novel asymmetric alignment in which unseen cells are mapped onto the previously learned latent space, avoiding the need of retraining the classification model for new datasets. JIND also learns cell-type-specific confidence thresholds to identify cells that cannot be reliably classified. Results We show on several batched datasets that the joint approach to integration and classification of JIND outperforms in accuracy existing pipelines, and a smaller fraction of cells is rejected as unlabeled as a result of the cell-specific confidence thresholds. Moreover, we investigate cells misclassified by JIND and provide evidence suggesting that they could be due to outliers in the annotated datasets or errors in the original approach used for annotation of the target batch. Availability and implementation Implementation for JIND is available at https://github.com/mohit1997/JIND and the data underlying this article can be accessed at https://doi.org/10.5281/zenodo.6246322. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
2046991
NSF-PAR ID:
10324461
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Mathelier, Anthony
Date Published:
Journal Name:
Bioinformatics
Volume:
38
Issue:
9
ISSN:
1367-4803
Page Range / eLocation ID:
2488 to 2495
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Birol, Inanc (Ed.)
    Abstract Motivation

    Single-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge.

    Results

    Here, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches.

    Availability and implementation

    Web server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Elofsson, Arne (Ed.)
    Abstract Motivation Cryoelectron tomography (cryo-ET) visualizes structure and spatial organization of macromolecules and their interactions with other subcellular components inside single cells in the close-to-native state at submolecular resolution. Such information is critical for the accurate understanding of cellular processes. However, subtomogram classification remains one of the major challenges for the systematic recognition and recovery of the macromolecule structures in cryo-ET because of imaging limits and data quantity. Recently, deep learning has significantly improved the throughput and accuracy of large-scale subtomogram classification. However, often it is difficult to get enough high-quality annotated subtomogram data for supervised training due to the enormous expense of labeling. To tackle this problem, it is beneficial to utilize another already annotated dataset to assist the training process. However, due to the discrepancy of image intensity distribution between source domain and target domain, the model trained on subtomograms in source domain may perform poorly in predicting subtomogram classes in the target domain. Results In this article, we adapt a few shot domain adaptation method for deep learning-based cross-domain subtomogram classification. The essential idea of our method consists of two parts: (i) take full advantage of the distribution of plentiful unlabeled target domain data, and (ii) exploit the correlation between the whole source domain dataset and few labeled target domain data. Experiments conducted on simulated and real datasets show that our method achieves significant improvement on cross domain subtomogram classification compared with baseline methods. Availability and implementation Software is available online https://github.com/xulabs/aitom. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract Motivation

    Single-cell RNA sequencing (scRNAseq) technologies allow for measurements of gene expression at a single-cell resolution. This provides researchers with a tremendous advantage for detecting heterogeneity, delineating cellular maps or identifying rare subpopulations. However, a critical complication remains: the low number of single-cell observations due to limitations by rarity of subpopulation, tissue degradation or cost. This absence of sufficient data may cause inaccuracy or irreproducibility of downstream analysis. In this work, we present Automated Cell-Type-informed Introspective Variational Autoencoder (ACTIVA): a novel framework for generating realistic synthetic data using a single-stream adversarial variational autoencoder conditioned with cell-type information. Within a single framework, ACTIVA can enlarge existing datasets and generate specific subpopulations on demand, as opposed to two separate models [such as single-cell GAN (scGAN) and conditional scGAN (cscGAN)]. Data generation and augmentation with ACTIVA can enhance scRNAseq pipelines and analysis, such as benchmarking new algorithms, studying the accuracy of classifiers and detecting marker genes. ACTIVA will facilitate analysis of smaller datasets, potentially reducing the number of patients and animals necessary in initial studies.

    Results

    We train and evaluate models on multiple public scRNAseq datasets. In comparison to GAN-based models (scGAN and cscGAN), we demonstrate that ACTIVA generates cells that are more realistic and harder for classifiers to identify as synthetic which also have better pair-wise correlation between genes. Data augmentation with ACTIVA significantly improves classification of rare subtypes (more than 45% improvement compared with not augmenting and 4% better than cscGAN) all while reducing run-time by an order of magnitude in comparison to both models.

    Availability and implementation

    The codes and datasets are hosted on Zenodo (https://doi.org/10.5281/zenodo.5879639). Tutorials are available at https://github.com/SindiLab/ACTIVA.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract

    Large-scale scRNA-seq studies typically generate data in batches, which often induce nontrivial batch effects that need to be corrected. Given the global efforts for building cell atlases and the increasing number of annotated scRNA-seq datasets accumulated, we propose a supervised strategy for scRNA-seq data integration called SIDA (SupervisedIntegration usingDomainAdaptation), which uses the cell type annotations to guide the integration of diverse batches. The supervised strategy is based on domain adaptation that was initially proposed in the computer vision field. We demonstrate that SIDA is able to generate comprehensive reference datasets that lead to improved accuracy in automated cell type mapping analyses.

     
    more » « less
  5. This work investigates how different forms of input elicitation obtained from crowdsourcing can be utilized to improve the quality of inferred labels for image classification tasks, where an image must be labeled as either positive or negative depending on the presence/absence of a specified object. Five types of input elicitation methods are tested: binary classification (positive or negative); the ( x, y )-coordinate of the position participants believe a target object is located; level of confidence in binary response (on a scale from 0 to 100%); what participants believe the majority of the other participants' binary classification is; and participant's perceived difficulty level of the task (on a discrete scale). We design two crowdsourcing studies to test the performance of a variety of input elicitation methods and utilize data from over 300 participants. Various existing voting and machine learning (ML) methods are applied to make the best use of these inputs. In an effort to assess their performance on classification tasks of varying difficulty, a systematic synthetic image generation process is developed. Each generated image combines items from the MPEG-7 Core Experiment CE-Shape-1 Test Set into a single image using multiple parameters (e.g., density, transparency, etc.) and may or may not contain a target object. The difficulty of these images is validated by the performance of an automated image classification method. Experiment results suggest that more accurate results can be achieved with smaller training datasets when both the crowdsourced binary classification labels and the average of the self-reported confidence values in these labels are used as features for the ML classifiers. Moreover, when a relatively larger properly annotated dataset is available, in some cases augmenting these ML algorithms with the results (i.e., probability of outcome) from an automated classifier can achieve even higher performance than what can be obtained by using any one of the individual classifiers. Lastly, supplementary analysis of the collected data demonstrates that other performance metrics of interest, namely reduced false-negative rates, can be prioritized through special modifications of the proposed aggregation methods. 
    more » « less