skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improved DeepSORT-Based Object Tracking in Foggy Weather for AVs Using Sematic Labels and Fused Appearance Feature Network
The presence of fog in the background can prevent small and distant objects from being detected, let alone tracked. Under safety-critical conditions, multi-object tracking models require faster tracking speed while maintaining high object-tracking accuracy. The original DeepSORT algorithm used YOLOv4 for the detection phase and a simple neural network for the deep appearance descriptor. Consequently, the feature map generated loses relevant details about the track being matched with a given detection in fog. Targets with a high degree of appearance similarity on the detection frame are more likely to be mismatched, resulting in identity switches or track failures in heavy fog. We propose an improved multi-object tracking model based on the DeepSORT algorithm to improve tracking accuracy and speed under foggy weather conditions. First, we employed our camera-radar fusion network (CR-YOLOnet) in the detection phase for faster and more accurate object detection. We proposed an appearance feature network to replace the basic convolutional neural network. We incorporated GhostNet to take the place of the traditional convolutional layers to generate more features and reduce computational complexities and costs. We adopted a segmentation module and fed the semantic labels of the corresponding input frame to add rich semantic information to the low-level appearance feature maps. Our proposed method outperformed YOLOv5 + DeepSORT with a 35.15% increase in multi-object tracking accuracy, a 32.65% increase in multi-object tracking precision, a speed increase by 37.56%, and identity switches decreased by 46.81%.  more » « less
Award ID(s):
1824267
PAR ID:
10579926
Author(s) / Creator(s):
;
Publisher / Repository:
Sensors
Date Published:
Journal Name:
Sensors
Volume:
24
Issue:
14
ISSN:
1424-8220
Page Range / eLocation ID:
4692
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. AVs are affected by reduced maneuverability and performance due to the degradation of sensor performances in fog. Such degradation can cause significant object detection errors in AVs’ safety-critical conditions. For instance, YOLOv5 performs well under favorable weather but is affected by mis-detections and false positives due to atmospheric scattering caused by fog particles. The existing deep object detection techniques often exhibit a high degree of accuracy. Their drawback is being sluggish in object detection in fog. Object detection methods with a fast detection speed have been obtained using deep learning at the expense of accuracy. The problem of the lack of balance between detection speed and accuracy in fog persists. This paper presents an improved YOLOv5-based multi-sensor fusion network that combines radar object detection with a camera image bounding box. We transformed radar detection by mapping the radar detections into a two-dimensional image coordinate and projected the resultant radar image onto the camera image. Using the attention mechanism, we emphasized and improved the important feature representation used for object detection while reducing high-level feature information loss. We trained and tested our multi-sensor fusion network on clear and multi-fog weather datasets obtained from the CARLA simulator. Our results show that the proposed method significantly enhances the detection of small and distant objects. Our small CR-YOLOnet model best strikes a balance between accuracy and speed, with an accuracy of 0.849 at 69 fps. 
    more » « less
  2. Object detection is a crucial task for autonomous driving. In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires realtime inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment. In this work, we propose SqueezeDet, a fully convolutional neural network for object detection that aims to simultaneously satisfy all of the above constraints. In our network we use convolutional layers not only to extract feature maps, but also as the output layer to compute bounding boxes and class probabilities. The detection pipeline of our model only contains a single forward pass of a neural network, thus it is extremely fast. Our model is fully convolutional, which leads to small model size and better energy efficiency. Finally, our experiments show that our model is very accurate, achieving state-of-the-art accuracy on the KITTI [9] benchmark. The source code of SqueezeDet is open-source released1. 
    more » « less
  3. Camera-based systems are increasingly used for collecting information on intersections and arterials. Unlike loop controllers that can generally be only used for detection and movement of vehicles, cameras can provide rich information about the traffic behavior. Vision-based frameworks for multiple-object detection, object tracking, and near-miss detection have been developed to derive this information. However, much of this work currently addresses processing videos offline. In this article, we propose an integrated two-stream convolutional networks architecture that performs real-time detection, tracking, and near-accident detection of road users in traffic video data. The two-stream model consists of a spatial stream network for object detection and a temporal stream network to leverage motion features for multiple-object tracking. We detect near-accidents by incorporating appearance features and motion features from these two networks. Further, we demonstrate that our approaches can be executed in real-time and at a frame rate that is higher than the video frame rate on a variety of videos collected from fisheye and overhead cameras. 
    more » « less
  4. The rapid progress in intelligent vehicle technology has led to a significant reliance on computer vision and deep neural networks (DNNs) to improve road safety and driving experience. However, the image signal processing (ISP) steps required for these networks, including demosaicing, color correction, and noise reduction, increase the overall processing time and computational resources. To address this, our paper proposes an improved version of the Faster R-CNN algorithm that integrates camera parameters into raw image input, reducing dependence on complex ISP steps while enhancing object detection accuracy. Specifically, we introduce additional camera parameters, such as ISO speed rating, exposure time, focal length, and F-number, through a custom layer into the neural network. Further, we modify the traditional Faster R-CNN model by adding a new fully connected layer, combining these parameters with the original feature maps from the backbone network. Our proposed new model, which incorporates camera parameters, has a 4.2% improvement in mAP@[0.5,0.95] compared to the traditional Faster RCNN model for object detection tasks on raw image data. 
    more » « less
  5. Vision transformers (ViTs) have dominated computer vision in recent years. However, ViTs are computationally expensive and not well suited for mobile devices; this led to the prevalence of convolutional neural network (CNN) and ViT-based hybrid models for mobile vision applications. Recently, Vision GNN (ViG) and CNN hybrid models have also been proposed for mobile vision tasks. However, all of these methods remain slower compared to pure CNN-based models. In this work, we propose Multi-Level Dilated Convolutions to devise a purely CNN-based mobile backbone. Using Multi-Level Dilated Convolutions allows for a larger theoretical receptive field than standard convolutions. Different levels of dilation also allow for interactions between the short-range and long-range features in an image. Experiments show that our proposed model outperforms state-of-the-art (SOTA) mobile CNN, ViT, ViG, and hybrid architectures in terms of accuracy and/or speed on image classification, object detection, instance segmentation, and semantic segmentation. Our fastest model, RapidNet-Ti, achieves 76.3% top-1 accuracy on ImageNet-1K with 0.9 ms inference latency on an iPhone 13 mini NPU, which is faster and more accurate than MobileNetV2x1.4 (74.7% top-1 with 1.0 ms latency). Our work shows that pure CNN architectures can beat SOTA hybrid and ViT models in terms of accuracy and speed when designed properly 
    more » « less