Streaming video algorithms dynamically select between different versions of a video to deliver the highest quality version that can be viewed without buffering over the client’s connection. To improve the quality for viewers, the backing video service can generate more and/or better versions, but at a significant computational overhead. Processing all videos uploaded to Facebook in the most intensive way would require a prohibitively large cluster. Facebook’s video popularity distribution is highly skewed, however, with analysis on sampled videos showing 1% of them accounting for 83% of the total watch time by users. Thus, if we can predict the future popularity of videos, we can focus the intensive processing on those videos that improve the quality of the most watch time. To address this challenge, we designed CHESS, the first popularity prediction algorithm that is both scalable and accurate. CHESS is scalable because, unlike the state-ofthe- art approaches, it requires only constant space per video, enabling it to handle Facebook’s video workload. CHESS is accurate because it delivers superior predictions using a combination of historical access patterns with social signals in a unified online learning framework. We have built a video prediction service, CHESSVPS, using our new algorithm that can handle Facebook’s workload with only four machines. We find that re-encoding popular videos predicted by CHESSVPS enables a higher percentage of total user watch time to benefit from intensive encoding, with less overhead than a recent production heuristic, e.g., 80% of watch time with one-third as much overhead.
more »
« less
How Local Is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization
The large volume of video content and high viewing frequency demand automatic video summarization algorithms, of which a key property is the capability of modeling diversity. If videos are lengthy like hours-long egocentric videos, it is necessary to track the temporal structures of the videos and enforce local diversity. The local diversity refers to that the shots selected from a short time duration are diverse but visually similar shots are allowed to co-exist in the summary if they appear far apart in the video. In this project, we propose a novel probabilistic model, built upon SeqDPP [1], to dynamically control the time span of a video segment upon which the local diversity is imposed. In particular, we enable SeqDPP to learn to automatically infer how local the local diversity is supposed to be from the input video. The resulting model is extremely involved to train by the hallmark maximum likelihood estimation (MLE), which further suffers from the exposure bias and non-differentiable evaluation metrics. To tackle these problems, we instead devise a reinforcement learning algorithm for training the proposed model. Extensive experiments verify the advantages of our model and the new learning algorithm over MLE-based methods.
more »
« less
- Award ID(s):
- 1741431
- PAR ID:
- 10111175
- Date Published:
- Journal Name:
- Computer Vision - ECCV 2018
- Volume:
- 11212
- Page Range / eLocation ID:
- 156-174
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Advancements in smartphone applications have empowered even non-technical users to perform sophisticated operations such as morphing in faces as few tap operations. While such enablements have positive effects, as a negative side, now anyone can digitally attack face (biometric) recognition systems. For example, face swapping application of Snapchat can easily create “swapped” identities and circumvent face recognition system. This research presents a novel database, termed as SWAPPED - Digital Attack Video Face Database, prepared using Snapchat’s application which swaps/stitches two faces and creates videos. The database contains bonafide face videos and face swapped videos of multiple subjects. Baseline face recognition experiments using commercial system shows over 90% rank-1 accuracy when attack videos are used as probe. As a second contribution, this research also presents a novel Weighted Local Magnitude Pattern feature descriptor based presentation attack detection algorithm which outperforms several existing approaches.more » « less
-
In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-effcient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our fndings indicate that VILP can rely less on extensive high-quality task-specifc robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related felds and directions. For more details, please refer to our open-source repository https://github.com/ZhengtongXu/VILP.more » « less
-
Krause, Andreas and (Ed.)We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($$K$$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $$K$$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning.more » « less
-
In educational research, user-simulation interaction is gaining importance as it provides key insights into the effectiveness of simulation-based learning and immersive technologies. A common approach to study user-simulation interaction involves manually analyzing participant interaction in real-time or via video recordings, which is a tedious process. Surveys/questionnaires are also commonly used but are open to subjectivity and only provide qualitative data. The tool proposed in this paper, which we call Environmental Detection for User-Simulation Interaction Measurement (EDUSIM), is a publicly available video analytics tool that receives screen-recorded video input from participants interacting with a simulated environment and outputs statistical data related to time spent in pre-defined areas of interest within the simulation model. The proposed tool utilizes machine learning, namely multi-classification Convolutional Neural Networks, to provide an efficient, automated process for extracting such navigation data. EDUSIM also implements a binary classification model to flag imperfect input video data such as video frames that are outside the specified simulation environment. To assess the efficacy of the tool, we implement a set of immersive simulation-based learning (ISBL) modules in an undergraduate database course, where learners record their screens as they interact with a simulation to complete their ISBL assignments. We then use the EDUSIM tool to analyze the videos collected and compare the tool’s outputs with the expected results obtained by manually analyzing the videos.more » « less
An official website of the United States government

