Object detection in high-resolution aerial images is a challenging task because of 1) the large variation in object size, and 2) non-uniform distribution of objects. A common solution is to divide the large aerial image into small (uniform) crops and then apply object detection on each small crop. In this paper, we investigate the image cropping strategy to address these challenges. Specifically, we propose a Density-Map guided object detection Network (DMNet), which is inspired from the observation that the object density map of an image presents how objects distribute in terms of the pixel intensity of the map. As pixel intensity varies, it is able to tell whether a region has objects or not, which in turn provides guidance for cropping images statistically. DMNet has three key components: a density map generation module, an image cropping module and an object detector. DMNet generates a density map and learns scale information based on density intensities to form cropping regions. Extensive experiments show that DMNet achieves state-of-the-art performance on two popular aerial image datasets, i.e. VisionDrone and UAVDT.
more »
« less
Clustered Object Detection in Aerial Images
Detecting objects in aerial images is challenging for at
least two reasons: (1) target objects like pedestrians are
very small in pixels, making them hardly distinguished from
surrounding background; and (2) targets are in general
sparsely and non-uniformly distributed, making the detection
very inefficient. In this paper, we address both issues
inspired by observing that these targets are often clustered.
In particular, we propose a Clustered Detection (ClusDet)
network that unifies object clustering and detection in an
end-to-end framework. The key components in ClusDet include
a cluster proposal sub-network (CPNet), a scale estimation
sub-network (ScaleNet), and a dedicated detection
network (DetecNet). Given an input image, CPNet produces
object cluster regions and ScaleNet estimates object scales
for these regions. Then, each scale-normalized cluster region
and their features are fed into DetecNet for object detection.
ClusDet has several advantages over previous solutions:
(1) it greatly reduces the number of chips for final
object detection and hence achieves high running time efficiency,
(2) the cluster-based scale estimation is more accurate
than previously used single-object based ones, hence
effectively improves the detection for small objects, and (3)
the final DetecNet is dedicated for clustered regions and implicitly
models the prior context information so as to boost
detection accuracy. The proposed method is tested on three
popular aerial image datasets including VisDrone, UAVDT
and DOTA. In all experiments, ClusDet achieves promising
performance in comparison with state-of-the-art detectors
more »
« less
- Award ID(s):
- 1814745
- NSF-PAR ID:
- 10109443
- Date Published:
- Journal Name:
- IEEE International Conference on Computer Vision Workshops
- ISSN:
- 2473-9936
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Long-range target detection in thermal infrared imagery is a challenging research problem due to the low resolution and limited detail captured by thermal sensors. The limited size and variability in thermal image datasets for small target detection is also a major constraint for the development of accurate and robust detection algorithms. To address both the sensor and data constraints, we propose a novel convolutional neural network (CNN) feature extraction architecture designed for small object detection in data-limited settings. More specifically, we focus on long-range ground-based thermal vehicle detection, but also show the effectiveness of the proposed algorithm on drone and satellite aerial imagery. The design of the proposed architecture is inspired by an analysis of popular object detectors as well as custom-designed networks. We find that restricted receptive fields (rather than more globalized features, as is the trend), along with less down sampling of feature maps and attenuated processing of fine-grained features, lead to greatly improved detection rates while mitigating the model’s capacity to overfit on small or poorly varied datasets. Our approach achieves state-of-the-art results on the Defense Systems Information Analysis Center (DSIAC) automated target recognition (ATR) and the Tiny Object Detection in Aerial Images (AI-TOD) datasets.more » « less
-
UAVs (unmanned aerial vehicles) or drones are promising instruments for video-based surveillance. Various applications of aerial surveillance use object detection programs to detect target objects. In such applications, three parameters influence a drone deployment strategy: the area covered by the drone, the latency of target (object) detection, and the quality of the detection output by the object detector. Previous works have focused on improving Pareto optimality along the area-latency frontier or the area-quality frontier, but not on the combined area-latency-quality frontier, because of which these solutions are sub-optimal for drone-based surveillance. We explore a three way tradeoff between area, latency, and quality in the context of autonomous aerial surveillance of targets in an area using drones with cameras and an object detection program. We propose Vega, a drone deployment framework that captures these tradeoffs to deploy drones efficiently. We make three contributions with Vega. First, we characterize the ability of the state-of-the-art mobile object detector, EfficientDet [CPVR '20], to detect objects from varying drone altitudes using confidence and IoU curves vs. drone altitude. Second, based on these characteristics of the detector, we propose a set of two algorithmic primitives for drone-based maneuvers, namely DroneZoom and DroneCycle. Using these two primitives, we obtain a more optimal Pareto frontier between our three target parameters - coverage area, detection latency, and detection quality for a single drone system. Third, we scale out our findings to a swarm deployment using higher-order Voronoi tessellations, where we control the swarm's spatial density using the Voronoi order to further lower the detection latency while maintaining detection quality.more » « less
-
Detecting small objects (e.g., manhole covers, license plates, and roadside milestones) in urban images is a long-standing challenge mainly due to the scale of small object and background clutter. Although convolution neural network (CNN)-based methods have made significant progress and achieved impressive results in generic object detection, the problem of small object detection remains unsolved. To address this challenge, in this study we developed an end-to-end network architecture that has three significant characteristics compared to previous works. First, we designed a backbone network module, namely Reduced Downsampling Network (RD-Net), to extract informative feature representations with high spatial resolutions and preserve local information for small objects. Second, we introduced an Adjustable Sample Selection (ADSS) module which frees the Intersection-over-Union (IoU) threshold hyperparameters and defines positive and negative training samples based on statistical characteristics between generated anchors and ground reference bounding boxes. Third, we incorporated the generalized Intersection-over-Union (GIoU) loss for bounding box regression, which efficiently bridges the gap between distance-based optimization loss and area-based evaluation metrics. We demonstrated the effectiveness of our method by performing extensive experiments on the public Urban Element Detection (UED) dataset acquired by Mobile Mapping Systems (MMS). The Average Precision (AP) of the proposed method was 81.71%, representing an improvement of 1.2% compared with the popular detection framework Faster R-CNN.more » « less
-
Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and relevant concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of our deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training.more » « less