skip to main content

Title: Smart City Traffic Intersection: Impact of Video Quality and Scene Complexity on Precision and Inference
Traffic intersections are prime locations for deployment of infrastructure sensors and edge computing nodes to realize the vision of a smart city. It is expected that the needs of a smart city, in regards to traffic and pedestrian traffic systems monitored by cameras/video, can be met by using stateof-the-art artificial-intelligence (AI) based object detectors and trackers. A critical component in designing an effective real-time object detection/tracking pipeline is the understanding of how object density, i.e., the number of objects in a scene, and imageresolution and frame rate influence the performance metrics. This study explores the accuracy and speed metrics with the goal of supporting pipelines that meet the precision and latency needs of a real-time environment. We examine the impact of varying image-resolution, frame rate and object-density on the object detection performance metrics. The experiments on the COSMOS testbed dataset show that varying the frame width from 416 pixels to 832 pixels, and cropping the images to a square resolution, result in the increase in average precision for all object classes. Decreasing the frame rate from 15 fps to 5 fps preserves more than 90% of the highest F1 score achieved for all object classes. The results inform the choice of video preprocessing stages, modifications to established AI-based object detection/tracking methods, and suggest optimal hyper-parameter values. Index Terms—Object Detection, Smart more » City, Video Resolution, Deep Learning Models. « less
; ; ; ; ; ; ; ; ;
Award ID(s):
2029295 2038984
Publication Date:
Journal Name:
in Proc. 19th IEEE Int. Conf. on Smart City, 2021
Page Range or eLocation-ID:
1521 to 1528
Sponsoring Org:
National Science Foundation
More Like this
  1. The traffic congestion hits most big cities in the world - threatening long delays and serious reductions in air quality. City and local government officials continue to face challenges in optimizing crowd flow, synchronizing traffic and mitigating threats or dangerous situations. One of the major challenges faced by city planners and traffic engineers is developing a robust traffic controller that eliminates traffic congestion and imbalanced traffic flow at intersections. Ensuring that traffic moves smoothly and minimizing the waiting time in intersections requires automated vehicle detection techniques for controlling the traffic light automatically, which are still challenging problems. In this paper, we propose an intelligent traffic pattern collection and analysis model, named TPCAM, based on traffic cameras to help in smooth vehicular movement on junctions and set to reduce the traffic congestion. Our traffic detection and pattern analysis model aims at detecting and calculating the traffic flux of vehicles and pedestrians at intersections in real-time. Our system can utilize one camera to capture all the traffic flows in one intersection instead of multiple cameras, which will reduce the infrastructure requirement and potential for easy deployment. We propose a new deep learning model based on YOLOv2 and adapt the model for themore »traffic detection scenarios. To reduce the network burdens and eliminate the deployment of network backbone at the intersections, we propose to process the traffic video data at the network edge without transmitting the big data back to the cloud. To improve the processing frame rate at the edge, we further propose deep object tracking algorithm leveraging adaptive multi-modal models and make it robust to object occlusions and varying lighting conditions. Based on the deep learning based detection and tracking, we can achieve pseudo-30FPS via adaptive key frame selection.« less
  2. The density and complexity of urban environments present significant challenges for autonomous vehicles. Moreover, ensuring pedestrians’ safety and protecting personal privacy are crucial considerations in these environments. Smart city intersections and AI-powered traffic management systems will be essential for addressing these challenges. Therefore, our research focuses on creating an experimental framework for the design of applications that support the secure and efficient management of traffic intersections in urban areas. We integrated two cameras (street-level and bird’s eye view), both viewing an intersection, and a programmable edge computing node, deployed within the COSMOS testbed in New York City, with a central management platform provided by Kentyou. We designed a pipeline to collect and analyze the video streams from both cameras and obtain real-time traffic/pedestrian-related information to support smart city applications. The obtained information from both cameras is merged, and the results are sent to a dedicated dashboard for real-time visualization and further assessment (e.g., accident prevention). The process does not require sending the raw videos in order to avoid violating pedestrians’ privacy. In this demo, we present the designed video analytic pipelines and their integration with Kentyou central management platform. Index Terms—object detection and tracking, camera networks, smart intersection, real-time visualization
  3. Camera-based systems are increasingly used for collecting information on intersections and arterials. Unlike loop controllers that can generally be only used for detection and movement of vehicles, cameras can provide rich information about the traffic behavior. Vision-based frameworks for multiple-object detection, object tracking, and near-miss detection have been developed to derive this information. However, much of this work currently addresses processing videos offline. In this article, we propose an integrated two-stream convolutional networks architecture that performs real-time detection, tracking, and near-accident detection of road users in traffic video data. The two-stream model consists of a spatial stream network for object detection and a temporal stream network to leverage motion features for multiple-object tracking. We detect near-accidents by incorporating appearance features and motion features from these two networks. Further, we demonstrate that our approaches can be executed in real-time and at a frame rate that is higher than the video frame rate on a variety of videos collected from fisheye and overhead cameras.
  4. Social distancing can reduce the infection rates in respiratory pandemics such as COVID-19. Traffic intersections are particularly suitable for monitoring and evaluation of social distancing behavior in metropolises. Hence, in this paper, we propose and evaluate a real-time privacy-preserving social distancing analysis system (B-SDA), which uses bird’s-eye view video recordings of pedestrians who cross traffic intersections. We devise algorithms for video pre-processing, object detection, and tracking which are rooted in the known computer-vision and deep learning techniques, but modified to address the problem of detecting very small objects/pedestrians captured by a highly elevated camera. We propose a method for incorporating pedestrian grouping for detection of social distancing violations, which achieves 0.92 F1 score. B-SDA is used to compare pedestrian behavior in pre-pandemic and during-pandemic videos in uptown Manhattan, showing that the social distancing violation rate of 15.6% during the pandemic is notably lower than 31.4% prepandemic baseline. Keywords—Social distancing, Object detection, Smart city, Testbeds
  5. Abstract

    Advances in visual perceptual tasks have been mainly driven by the amount, and types, of annotations of large-scale datasets. Researchers have focused on fully-supervised settings to train models using offline epoch-based schemes. Despite the evident advancements, limitations and cost of manually annotated datasets have hindered further development for event perceptual tasks, such as detection and localization of objects and events in videos. The problem is more apparent in zoological applications due to the scarcity of annotations and length of videos-most videos are at most ten minutes long. Inspired by cognitive theories, we present a self-supervised perceptual prediction framework to tackle the problem of temporal event segmentation by building a stable representation of event-related objects. The approach is simple but effective. We rely on LSTM predictions of high-level features computed by a standard deep learning backbone. For spatial segmentation, the stable representation of the object is used by an attention mechanism to filter the input features before the prediction step. The self-learned attention maps effectively localize the object as a side effect of perceptual prediction. We demonstrate our approach on long videos from continuous wildlife video monitoring, spanning multiple days at 25 FPS. We aim to facilitate automated ethogramming bymore »detecting and localizing events without the need for labels. Our approach is trained in an online manner on streaming input and requires only a single pass through the video, with no separate training set. Given the lack of long and realistic (includes real-world challenges) datasets, we introduce a new wildlife video dataset–nest monitoring of the Kagu (a flightless bird from New Caledonia)–to benchmark our approach. Our dataset features a video from 10 days (over 23 million frames) of continuous monitoring of the Kagu in its natural habitat. We annotate every frame with bounding boxes and event labels. Additionally, each frame is annotated with time-of-day and illumination conditions. We will make the dataset, which is the first of its kind, and the code available to the research community. We find that the approach significantly outperforms other self-supervised, traditional (e.g., Optical Flow, Background Subtraction) and NN-based (e.g., PA-DPC, DINO, iBOT), baselines and performs on par with supervised boundary detection approaches (i.e., PC). At a recall rate of 80%, our best performing model detects one false positive activity every 50 min of training. On average, we at least double the performance of self-supervised approaches for spatial segmentation. Additionally, we show that our approach is robust to various environmental conditions (e.g., moving shadows). We also benchmark the framework on other datasets (i.e., Kinetics-GEBD, TAPOS) from different domains to demonstrate its generalizability. The data and code are available on our project page:

    « less