skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 23, 2026

Title: Improving Multi-Camera View Recommendation with Temporal and Camera Embedding
Multi-camera systems are essential in movies, live broadcasts, and other media. The selection of the appropriate camera for every moment has a decisive impact on production quality and audience preferences. Learning-based multi-camera view recommendation frameworks have been explored to assist professionals in decision making. This work explores how two standard cinematography practices could be incorporated into the learning pipeline: (1) not staying on the same camera for too long and (2) introducing a scene from a wider shot and gradually progressing to narrower ones. In these regards, we incorporate (1) the duration of the displaying camera and (2) camera identity as temporal and camera embedding in a transformer architecture, thereby implicitly guiding the model to learn the two practices from professional-labeled data. Experiments show that the proposed framework outperforms the baseline by 14.68% in six-way classification accuracy. Ablation studies on different approaches to embedding the temporal and camera information further verify the efficacy of the framework.  more » « less
Award ID(s):
1900875 2106592
PAR ID:
10635554
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3315-2358-9
Page Range / eLocation ID:
1 to 5
Subject(s) / Keyword(s):
Transformer Temporal Information Filming Learning-based Framework Multi-camera System Audience Preferences Heuristic Cross-entropy Loss Film Production Binary Cross Entropy Binary Cross-entropy Loss Embedding Learning Video Editing Position Embedding Multilayer Perception Past Frames
Format(s):
Medium: X
Location:
Darmstadt, Germany
Sponsoring Org:
National Science Foundation
More Like this
  1. Travel-time estimation of traffic flow is an important problem with critical implications for traffic congestion analysis. We developed techniques for using intersection videos to identify vehicle trajectories across multiple cameras and analyze corridor travel time. Our approach consists of (1) multi-object single-camera tracking, (2) vehicle re-identification among different cameras, (3) multi-object multi-camera tracking, and (4) travel-time estimation. We evaluated the proposed framework on real intersections in Florida with pan and fisheye cameras. The experimental results demonstrate the viability and effectiveness of our method. 
    more » « less
  2. null (Ed.)
    Cameras are deployed at scale with the purpose of searching and tracking objects of interest (e.g., a suspected person) through the camera network on live videos. Such cross-camera analytics is data and compute intensive, whose costs grow with the number of cameras and time. We present Spatula, a cost-efficient system that enables scaling cross-camera analytics on edge compute boxes to large camera networks by leveraging the spatial and temporal cross-camera correlations. While such correlations have been used in computer vision community, Spatula uses them to drastically reduce the communication and computation costs by pruning search space of a query identity (e.g., ignoring frames not correlated with the query identity’s current position). Spatula provides the first system substrate on which cross-camera analytics applications can be built to efficiently harness the cross-camera correlations that are abundant in large camera deployments. Spatula reduces compute load by $$8.3\times$$ on an 8-camera dataset, and by $$23\times-86\times$$ on two datasets with hundreds of cameras (simulated from real vehicle/pedestrian traces). We have also implemented Spatula on a testbed of 5 AWS DeepLens cameras. 
    more » « less
  3. Multi-camera systems are indispensable in movies, TV shows, and other media. Selecting the appropriate camera at every timestamp has a decisive impact on production quality and audience preferences. Learning-based view recommendation frameworks can assist professionals in decision-making. However, they often struggle outside of their training domains. The scarcity of labeled multi-camera view recommendation datasets exacerbates the issue. Based on the insight that many videos are edited from the original multi-camera videos, we propose transforming regular videos into pseudo-labeled multi-camera view recommendation datasets. Promisingly, by training the model on pseudo-labeled datasets stemming from videos in the target domain, we achieve a 68% relative improvement in the model’s accuracy in the target domain and bridge the accuracy gap between in-domain and never-before-seen domains. 
    more » « less
  4. ABSTRACT In Smart City and Vehicle-to-Everything (V2X) systems, acquiring pedestrians’ accurate locations is crucial to traffic and pedestrian safety. Current systems adopt cameras and wireless sensors to estimate people’s locations via sensor fusion. Standard fusion algorithms, however, become inapplicable when multi-modal data is not associated. For example, pedestrians are out of the camera field of view, or data from the camera modality is missing. To address this challenge and produce more accurate location estimations for pedestrians, we propose a localization solution based on a Generative Adversarial Network (GAN) architecture. During training, it learns the underlying linkage between pedestrians’ camera-phone data correspondences. During inference, it generates refined position estimations based only on pedestrians’ phone data that consists of GPS, IMU, and FTM. Results show that our GAN produces 3D coordinates at 1 to 2 meters localization error across 5 different outdoor scenes. We further show that the proposed model supports self-learning. The generated coordinates can be associated with pedestrians’ bounding box coordinates to obtain additional camera-phone data correspondences. This allows automatic data collection during inference. Results show that after fine-tuning the GAN model on the expanded 
    more » « less
  5. Driven by advances in computer vision and the falling costs of camera hardware, organizations are deploying video cameras en masse for the spatial monitoring of their physical premises. Scaling video analytics to massive camera deployments, however, presents a new and mounting challenge, as compute cost grows proportionally to the number of camera feeds. This paper is driven by a simple question: can we scale video analytics in such a way that cost grows sublinearly, or even remains constant, as we deploy more cameras, while inference accuracy remains stable, or even improves. We believe the answer is yes. Our key observation is that video feeds from wide-area camera deployments demonstrate significant content correlations (e.g. to other geographically proximate feeds), both in space and over time. These spatio-temporal correlations can be harnessed to dramatically reduce the size of the inference search space, decreasing both workload and false positive rates in multi-camera video analytics. By discussing use-cases and technical challenges, we propose a roadmap for scaling video analytics to large camera networks, and outline a plan for its realization. 
    more » « less