skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A High-Resolution Dataset for Instance Detection with Multi-View Instance Capture
Instance detection (InsDet) is a long-lasting problem in robotics and computer vision, aiming to detect object instances (predefined by some visual examples) in a cluttered scene. Despite its practical significance, its advancement is overshadowed by Object Detection, which aims to detect objects belonging to some predefined classes. One major reason is that current InsDet datasets are too small in scale by today's standards. For example, the popular InsDet dataset GMU (published in 2016) has only 23 instances, far less than COCO (80 classes), a well-known object detection dataset published in 2014. We are motivated to introduce a new InsDet dataset and protocol. First, we define a realistic setup for InsDet: training data consists of multi-view instance captures, along with diverse scene images allowing synthesizing training images by pasting instance images on them with free box annotations. Second, we release a real-world database, which contains multi-view capture of 100 object instances, and high-resolution (6k\texttimes{} 8k) testing images. Third, we extensively study baseline methods for InsDet on our dataset, analyze their performance and suggest future work. Somewhat surprisingly, using the off-the-shelf class-agnostic segmentation model (Segment Anything Model, SAM) and the self-supervised feature representation DINOv2 performs the best, achieving >10 AP better than end-to-end trained InsDet models that repurpose object detectors (e.g., FasterRCNN and RetinaNet).  more » « less
Award ID(s):
2213842
PAR ID:
10533485
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Neural Information Processing Systems Foundation, Inc. (NeurIPS)
Date Published:
ISSN:
1049-5258
ISBN:
9781713871088
Format(s):
Medium: X
Location:
New Orleans, LA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. One fundamental challenge in building an instance segmen- tation model for a large number of classes in complex scenes is the lack of training examples, especially for rare objects. In this paper, we ex- plore the possibility to increase the training examples without laborious data collection and annotation. We find that an abundance of instance segments can potentially be obtained freely from object-centric images, according to two insights: (i) an object-centric image usually contains one salient object in a simple background; (ii) objects from the same class often share similar appearances or similar contrasts to the background. Motivated by these insights, we propose a simple and scalable frame- work FreeSeg for extracting and leveraging these “free” object fore- ground segments to facilitate model training in long-tailed instance seg- mentation. Concretely, we investigate the similarity among object-centric images of the same class to propose candidate segments of foreground instances, followed by a novel ranking of segment quality. The resulting high-quality object segments can then be used to augment the exist- ing long-tailed datasets, e.g., by copying and pasting the segments onto the original training images. Extensive experiments show that FreeSeg yields substantial improvements on top of strong baselines and achieves state-of-the-art accuracy for segmenting rare object categories. Our code is publicly available at https://github.com/czhang0528/FreeSeg. 
    more » « less
  2. Vanilla models for object detection and instance segmentation suffer from the heavy bias toward detecting frequent objects in the long-tailed setting. Existing methods address this issue mostly during training, e.g., by re-sampling or re- weighting. In this paper, we investigate a largely overlooked approach — post- processing calibration of confidence scores. We propose NORCAL, Normalized Calibration for long-tailed object detection and instance segmentation, a simple and straightforward recipe that reweighs the predicted scores of each class by its training sample size. We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance. On the LVIS dataset, NORCAL can effectively improve nearly all the baseline models not only on rare classes but also on common and frequent classes. Finally, we conduct extensive analysis and ablation studies to offer insights into various modeling choices and mechanisms of our approach. Our code is publicly available at https://github.com/tydpan/NorCal. 
    more » « less
  3. This paper presents a unified framework to learn to quantify perceptual attributes (e.g., safety, attractiveness) of physical urban environments using crowd-sourced street-view photos without human annotations. The efforts of this work include two folds. First, we collect a large-scale urban image dataset in multiple major cities in U.S.A., which consists of multiple street-view photos for every place. Instead of using subjective annotations as in previous works, which are neither accurate nor consistent, we collect for every place the safety score from government’s crime event records as objective safety indicators. Second, we observe that the place-centric perception task is by nature a multi-instance regression problem since the labels are only available for places (bags), rather than images or image regions (instances). We thus introduce a deep convolutional neural network (CNN) to parameterize the instance-level scoring function, and develop an EM algorithm to alternatively estimate the primary instances (images or image regions) which affect the safety scores and train the proposed network. Our method is capable of localizing interesting images and image regions for each place.We evaluate the proposed method on a newly created dataset and a public dataset. Results with comparisons showed that our method can clearly outperform the alternative perception methods and more importantly, is capable of generating region-level safety scores to facilitate interpretations of the perception process. 
    more » « less
  4. In this paper, we explore the possibility to increase the training examples without laborious data collection and annotation for long-tailed instance segmentation. We find that an abundance of instance segments can potentially be obtained freely from object-centric images, according to two insights: (i) an object-centric image usually contains one salient object in a simple background; (ii) objects from the same class often share similar appearances or similar contrasts to the background. Motivated by these insights, we propose a simple and scalable framework FREESEG for extracting and leveraging these “free” object segments to facilitate model training. Concretely, we investigate the similarity among object-centric images of the same class to propose candidate segments of foreground instances, followed by a novel ranking of segment quality. The resulting high quality object segments can then be used to augment the existing long-tailed datasets, e.g., by copying and pasting the segments onto the original training images. Extensive experiments show that FREESEG yields substantial improvements on top of strong baselines and achieves state-of-the-art accuracy for segmenting rare object categories. 
    more » « less
  5. Instance segmentation of neural cells plays an important role in brain study. However, this task is challenging due to the special shapes and behaviors of neural cells. Existing methods are not precise enough to capture their tiny structures, e.g., filopodia and lamellipodia, which are critical to the understanding of cell interaction and behavior. To this end, we propose a novel deep multi-task learning model to jointly detect and segment neural cells instance-wise. Our method is built upon SSD, with ResNet101 as the backbone to achieve both high detection accuracy and fast speed. Furthermore, unlike existing works which tend to produce wavy and inaccurate boundaries, we embed a deconvolution module into SSD to better capture details. Experiments on a dataset of neural cell microscopic images show that our method is able to achieve better per- formance in terms of accuracy and efficiency, comparing favorably with current state-of-the-art methods. 
    more » « less