skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Maze: A Cost-Efficient Video Deduplication System at Web-scale
With the advancement and dominant service of Internet videos, the content-based video deduplication system becomes an essential and dependent infrastructure for Internet video service. However, the explosively growing video data on the Internet challenges the system design and implementation for its scalability in several ways. (1) Although the quantization-based indexing techniques are effective for searching visual features at a large scale, the costly re-training over the complete dataset must be done periodically. (2) The high-dimensional vectors for visual features demand increasingly large SSD space, degrading I/O performance. (3) Videos crawled from the Internet are diverse, and visually similar videos are not necessarily the duplicates, increasing deduplication complexity. (4) Most videos are edited ones. The duplicate contents are more likely discovered as clips inside the videos, demanding processing techniques with close attention to details. To address above-mentioned issues, we propose Maze, a full-fledged video deduplication system. Maze has an ANNS layer that indexes and searches the high dimensional feature vectors. The architecture of the ANNS layer supports efficient reads and writes and eliminates the data migration caused by re-training. Maze adopts the CNN-based feature and the ORB feature as the visual features, which are optimized for the specific video deduplication task. The features are compact and fully reside in the memory. Acoustic features are also incorporated in Maze so that the visually similar videos but having different audio tracks are recognizable. A clip-based matching algorithm is developed to discover duplicate contents at a fine granularity. Maze has been deployed as a production system for two years. It has indexed 1.3 billion videos and is indexing ~800 thousand videos per day. For the ANNS layer, the average read latency is 4 seconds and the average write latency is at most 4.84 seconds. The re-training over the complete dataset is no longer required no matter how many new data sets are added, eliminating the costly data migration between nodes. Maze recognizes the duplicate live streaming videos with both the similar appearance and the similar audio at a recall of 98%. Most importantly, Maze is also cost-effective. For example, the compact feature design helps save 5800 SSDs and the computation resources devoted to running the whole system decrease to 250K standard cores per billion videos.  more » « less
Award ID(s):
2005884 1718450 2210753
PAR ID:
10418826
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
MM '22: Proceedings of the 30th ACM International Conference on Multimedia
Page Range / eLocation ID:
3163 to 3172
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Video and animation are common ways of delivering concepts that cannot be easily communicated through text. This visual information is often inaccessible to blind and visually impaired people, and alternative representations such as Braille and audio may leave out important details. Audio-haptic displays with along with supplemental descriptions allow for the presentation of complex spatial information, along with accompanying description. We introduce the Haptic Video Player, a system for authoring and presenting audio-haptic content from videos. The Haptic Video Player presents video using mobile robots that can be touched as they move over a touch screen. We describe the design of the Haptic Video Player system, and present user studies with educators and blind individuals that demonstrate the ability of this system to render dynamic visual content non-visually. 
    more » « less
  2. Visual contents, including images and videos, are dominant on the Internet today. The conventional search engine is mainly designed for textual documents, which must be extended to process and manage increasingly high volumes of visual data objects. In this paper, we present Mixer, an effective system to identify and analyze visual contents and to extract their features for data retrievals, aiming at addressing two critical issues: (1) efficiently and timely understanding visual contents, (2) retrieving them at high precision and recall rates without impairing the performance. In Mixer, the visual objects are categorized into different classes, each of which has representative visual features. Subsystems for model production and model execution are developed. Two retrieval layers are designed and implemented for images and videos, respectively. In this way, we are able to perform aggregation retrievals of the two types in efficient ways. The experiments with Baidu's production workloads and systems show that Mixer halves the model production time and raises the feature production throughput by 9.14x. Mixer also achieves the precision and recall of video retrievals at 95% and 97%, respectively. Mixer has been in its daily operations, which makes the search engine highly scalable for visual contents at a low cost. Having observed productivity improvement of upper-level applications in the search engine, we believe our system framework would generally benefit other data processing applications. 
    more » « less
  3. null (Ed.)
    Visual contents, including images and videos, are dominant on the Internet today. The conventional search engine is mainly designed for textual documents, which must be extended to process and manage increasingly high volumes of visual data objects.In this paper, we present Mixer, an effective system to identify and analyze visual contents and to extract their features for data retrievals, aiming at addressing two critical issues: (1) efficiently and timely understanding visual contents, (2) retrieving them at high precision and recall rates without impairing the performance. In Mixer, the visual objects are categorized into different classes, each of which has representative visual features. Subsystems for model production and model execution are developed. Two retrieval layers are designed and implemented for images and videos, respectively.In this way, we are able to perform aggregation retrievals of the two types in efficient ways. The experiments with Baidu’s production workloads and systems show that Mixer halves the model production time and raises the feature production throughput by 9.14x.Mixer also achieves the precision and recall of video retrievals at 95% and 97%, respectively. Mixer has been in its daily operations, which makes the search engine highly scalable for visual contents at a low cost. Having observed productivity improvement of upper-level applications in the search engine, we believe our system framework would generally benefit other data processing applications, 
    more » « less
  4. In this paper we leverage the existence of a property in the duplicate data, named duplicate locality, that reveals the fact that multiple duplicate chunks are likely to occur together. In other words, one duplicate chunk is likely to be immediately followed by a sequence of contiguous duplicate chunks. The longer the sequence, the stronger the locality is. After a quantitative analysis of duplicate locality in real-world data, we propose a suite of chunking techniques that exploit the locality to remove almost all chunking cost for deduplicatable chunks in CDC-based deduplication systems. The resulting deduplication method, named RapidCDC, has two salient features. One is that its efficiency is positively correlated to the deduplication ratio. RapidCDC can be as fast as a fixed-size chunking method when applied on data sets with high data redundancy. The other feature is that its high efficiency does not rely on high duplicate locality strength. These attractive features make RapidCDC’s effectiveness almost guaranteed for datasets with high deduplication ratio. Our experimental results with synthetic and real-world datasets show that RapidCDC’s chunking speedup can be up to 33× higher than regular CDC. Meanwhile, it maintains (nearly) the same deduplication ratio. 
    more » « less
  5. Generating realistic audio for human actions is critical for applications such as film sound effects and virtual reality games. Existing methods assume complete correspondence between video and audio during training, but in real-world settings, many sounds occur off-screen or weakly correspond to visuals, leading to uncontrolled ambient sounds or hallucinations at test time. This paper introduces AV-LDM, a novel ambient-aware audio generation model that disentangles foreground action sounds from ambient background noise in in-the-wild training videos. The approach leverages a retrieval-augmented generation framework to synthesize audio that aligns both semantically and temporally with the visual input. Trained and evaluated on Ego4D and EPIC-KITCHENS datasets, along with the newly introduced Ego4D-Sounds dataset (1.2M curated clips with action-audio correspondence), the model outperforms prior methods, enables controllable ambient sound generation, and shows promise for generalization to synthetic video game clips. This work is the first to emphasize faithful video-to-audio generation focused on observed visual content despite noisy, uncurated training data. 
    more » « less