skip to main content

This content will become publicly available on June 21, 2023

Title: Instance Segmentation with Mask-supervised Polygonal Regression Transformers
In this paper, we present an end-to-end instance segmentation method that regresses a polygonal boundary for each object instance. This sparse, vectorized boundary representation for objects, while attractive in many downstream computer vision tasks, quickly runs into issues of parity that need to be addressed: parity in supervision and parity in performance when compared to existing pixel-based methods. This is due in part to object instances being annotated with ground-truth in the form of polygonal boundaries or segmentation masks, yet being evaluated in a convenient manner using only segmentation masks. Our method, BoundaryFormer, is a Transformer based architecture that directly predicts polygons yet uses instance mask segmentations as the ground-truth supervision for computing the loss. We achieve this by developing an end-to-end differentiable model that solely relies on supervision within the mask space through differentiable rasterization. BoundaryFormer matches or surpasses the Mask R-CNN method in terms of instance segmentation quality on both COCO and Cityscapes while exhibiting significantly better transferability across datasets.
Authors:
; ;
Award ID(s):
2127544
Publication Date:
NSF-PAR ID:
10350146
Journal Name:
IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN:
2332-564X
Sponsoring Org:
National Science Foundation
More Like this
  1. Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and relevant concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of our deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training.
  2. Panoptic segmentation requires segments of both “things” (countable object instances) and “stuff” (uncountable and amorphous regions) within a single output. A common approach involves the fusion of instance segmentation (for “things”) and semantic segmentation (for “stuff”) into a non-overlapping placement of segments, and resolves overlaps. However, instance ordering with detection confidence do not correlate well with natural occlusion relationship. To resolve this issue, we propose a branch that is tasked with modeling how two instance masks should overlap one another as a binary relation. Our method, named OCFusion, is lightweight but particularly effective in the instance fusion process. OCFusion is trained with the ground truth relation derived automatically from the existing dataset annotations. We obtain state-of-the-art results on COCO and show competitive results on the Cityscapes panoptic segmentation benchmark.
  3. ABSTRACT

    We apply a new deep learning technique to detect, classify, and deblend sources in multiband astronomical images. We train and evaluate the performance of an artificial neural network built on the Mask Region-based Convolutional Neural Network image processing framework, a general code for efficient object detection, classification, and instance segmentation. After evaluating the performance of our network against simulated ground truth images for star and galaxy classes, we find a precision of 92 per cent at 80 per cent recall for stars and a precision of 98 per cent at 80 per cent recall for galaxies in a typical field with ∼30 galaxies arcmin−2. We investigate the deblending capability of our code, and find that clean deblends are handled robustly during object masking, even for significantly blended sources. This technique, or extensions using similar network architectures, may be applied to current and future deep imaging surveys such as Large Synoptic Survey Telescope and Wide-Field Infrared Survey Telescope. Our code, astro r-cnn, is publicly available at https://github.com/burke86/astro_rcnn.

  4. Time lapse microscopy is essential for quantifying the dynamics of cells, subcellular organelles and biomolecules. Biologists use different fluorescent tags to label and track the subcellular structures and biomolecules within cells. However, not all of them are compatible with time lapse imaging, and the labeling itself can perturb the cells in undesirable ways. We hypothesized that phase image has the requisite information to identify and track nuclei within cells. By utilizing both traditional blob detection to generate binary mask labels from the stained channel images and the deep learning Mask RCNN model to train a detection and segmentation model, we managed to segment nuclei based only on phase images. The detection average precision is 0.82 when the IoU threshold is to be set 0.5. And the mean IoU for masks generated from phase images and ground truth masks from experts is 0.735. Without any ground truth mask labels during the training time, this is good enough to prove our hypothesis. This result enables the ability to detect nuclei without the need for exogenous labeling.
  5. In this paper, we introduce a practical system for interactive video object mask annotation, which can support multiple back-end methods. To demonstrate the generalization of our system, we introduce a novel approach for video object annotation. Our proposed system takes scribbles at a chosen key-frame from the end-users via a user-friendly interface and produces masks of corresponding objects at the key-frame via the Control-Point-based Scribbles-to-Mask (CPSM) module. The object masks at the key-frame are then propagated to other frames and refined through the Multi-Referenced Guided Segmentation (MRGS) module. Last but not least, the user can correct wrong segmentation at some frames, and the corrected mask is continuously propagated to other frames in the video via the MRGS to produce the object masks at all video frames.