skip to main content


Title: All-in-SAM: from Weak Annotation to Pixel-wise Nuclei Segmentation with Prompt-based Finetuning
The Segment Anything Model (SAM) is a recently proposed prompt-based segmentation model in a generic zero-shot segmentation approach. With the zero-shot segmentation capacity, SAM achieved impressive flexibility and precision on various segmentation tasks. However, the current pipeline requires manual prompts during the inference stage, which is still resource intensive for biomedical image segmentation. In this paper, instead of using prompts during the inference stage, we introduce a pipeline that utilizes the SAM, called all-in-SAM, through the entire AI development workflow (from annotation generation to model finetuning) without requiring manual prompts during the inference stage. Specifically, SAM is first employed to generate pixel-level annotations from weak prompts (e.g., points, bounding box). Then, the pixel-level annotations are used to finetune the SAM segmentation model rather than training from scratch. Our experimental results reveal two key findings: 1) the proposed pipeline surpasses the state-of-the-art (SOTA) methods in a nuclei segmentation task on the public Monuseg dataset, and 2) the utilization of weak and few annotations for SAM finetuning achieves competitive performance compared to using strong pixel-wise annotated data.  more » « less
Award ID(s):
2040462
NSF-PAR ID:
10447850
Author(s) / Creator(s):
Date Published:
Journal Name:
Asia Conference on Computers and Communications, ACCC
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The segment anything model (SAM) was released as a foundation model for image segmentation. The promptable segmentation model was trained by over 1 billion masks on 11M licensed and privacy-respecting images. The model supports zero-shot image segmentation with various seg- mentation prompts (e.g., points, boxes, masks). It makes the SAM attractive for medical image analysis, especially for digital pathology where the training data are rare. In this study, we eval- uate the zero-shot segmentation performance of SAM model on representative segmentation tasks on whole slide imaging (WSI), including (1) tumor segmentation, (2) non-tumor tissue segmen- tation, (3) cell nuclei segmentation. Core Results: The results suggest that the zero-shot SAM model achieves remarkable segmentation performance for large connected objects. However, it does not consistently achieve satisfying performance for dense instance object segmentation, even with 20 prompts (clicks/boxes) on each image. We also summarized the identified limitations for digital pathology: (1) image resolution, (2) multiple scales, (3) prompt selection, and (4) model fine-tuning. In the future, the few-shot fine-tuning with images from downstream pathological seg- mentation tasks might help the model to achieve better performance in dense object segmentation. 
    more » « less
  2. Nuclei segmentation is a fundamental task in histopathological image analysis. Typically, such segmentation tasks require significant effort to manually generate pixel-wise annotations for fully supervised training. To alleviate the manual effort, in this paper we propose a novel approach using points only annotation. Two types of coarse labels with complementary information are derived from the points annotation, and are then utilized to train a deep neural network. The fully- connected conditional random field loss is utilized to further refine the model without introducing extra computational complexity during inference. Experimental results on two nuclei segmentation datasets reveal that the proposed method is able to achieve competitive performance compared to the fully supervised counterpart and the state-of-the-art methods while requiring significantly less annotation effort. Our code is publicly available. 
    more » « less
  3. We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http://epic-kitchens.github.io/VISOR 
    more » « less
  4. This paper presents a semi-supervised framework for multi-level description learning aiming for robust and accurate camera relocalization across large perception variations. Our proposed network, namely DLSSNet, simultaneously learns weakly-supervised semantic segmentation and local feature description in the hierarchy. Therefore, the augmented descriptors, trained in an end-to-end manner, provide a more stable high-level representation for local feature dis-ambiguity. To facilitate end-to-end semantic description learning, the descriptor segmentation module is proposed to jointly learn semantic descriptors and cluster centers using standard semantic segmentation loss. We show that our model can be easily fine-tuned for domain-specific usage without any further semantic annotations, instead, requiring only 2D-2D pixel correspondences. The learned descriptors, trained with our proposed pipeline, can boost the cross-season localization performance against other state-of-the-arts. 
    more » « less
  5. null (Ed.)
    Fine-scale sea ice conditions are key to our efforts to understand and model climate change. We propose the first deep learning pipeline to extract fine-scale sea ice layers from high-resolution satellite imagery (Worldview-3). Extracting sea ice from imagery is often challenging due to the potentially complex texture from older ice floes (i.e., floating chunks of sea ice) and surrounding slush ice, making ice floes less distinctive from the surrounding water. We propose a pipeline using a U-Net variant with a Resnet encoder to retrieve ice floe pixel masks from very-high-resolution multispectral satellite imagery. Even with a modest-sized hand-labeled training set and the most basic hyperparameter choices, our CNN-based approach attains an out-of-sample F1 score of 0.698–a nearly 60% improvement when compared to a watershed segmentation baseline. We then supplement our training set with a much larger sample of images weak-labeled by a watershed segmentation algorithm. To ensure watershed derived pack-ice masks were a good representation of the underlying images, we created a synthetic version for each weak-labeled image, where areas outside the mask are replaced by open water scenery. Adding our synthetic image dataset, obtained at minimal effort when compared with hand-labeling, further improves the out-of-sample F1 score to 0.734. Finally, we use an ensemble of four test metrics and evaluated after mosaicing outputs for entire scenes to mimic production setting during model selection, reaching an out-of-sample F1 score of 0.753. Our fully-automated pipeline is capable of detecting, monitoring, and segmenting ice floes at a very fine level of detail, and provides a roadmap for other use-cases where partial results can be obtained with threshold-based methods but a context-robust segmentation pipeline is desired. 
    more » « less