skip to main content

Title: Mixer: efficiently understanding and retrieving visual content at web-scale
Visual contents, including images and videos, are dominant on the Internet today. The conventional search engine is mainly designed for textual documents, which must be extended to process and manage increasingly high volumes of visual data objects.In this paper, we present Mixer, an effective system to identify and analyze visual contents and to extract their features for data retrievals, aiming at addressing two critical issues: (1) efficiently and timely understanding visual contents, (2) retrieving them at high precision and recall rates without impairing the performance. In Mixer, the visual objects are categorized into different classes, each of which has representative visual features. Subsystems for model production and model execution are developed. Two retrieval layers are designed and implemented for images and videos, respectively.In this way, we are able to perform aggregation retrievals of the two types in efficient ways. The experiments with Baidu’s production workloads and systems show that Mixer halves the model production time and raises the feature production throughput by 9.14x.Mixer also achieves the precision and recall of video retrievals at 95% and 97%, respectively. Mixer has been in its daily operations, which makes the search engine highly scalable for visual contents at a low cost. Having observed more » productivity improvement of upper-level applications in the search engine, we believe our system framework would generally benefit other data processing applications, « less
Authors:
Award ID(s):
1718450
Publication Date:
NSF-PAR ID:
10294981
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
12
Page Range or eLocation-ID:
2906-2917
ISSN:
2150-8097
Sponsoring Org:
National Science Foundation
More Like this
  1. Visual contents, including images and videos, are dominant on the Internet today. The conventional search engine is mainly designed for textual documents, which must be extended to process and manage increasingly high volumes of visual data objects. In this paper, we present Mixer, an effective system to identify and analyze visual contents and to extract their features for data retrievals, aiming at addressing two critical issues: (1) efficiently and timely understanding visual contents, (2) retrieving them at high precision and recall rates without impairing the performance. In Mixer, the visual objects are categorized into different classes, each of which hasmore »representative visual features. Subsystems for model production and model execution are developed. Two retrieval layers are designed and implemented for images and videos, respectively. In this way, we are able to perform aggregation retrievals of the two types in efficient ways. The experiments with Baidu's production workloads and systems show that Mixer halves the model production time and raises the feature production throughput by 9.14x. Mixer also achieves the precision and recall of video retrievals at 95% and 97%, respectively. Mixer has been in its daily operations, which makes the search engine highly scalable for visual contents at a low cost. Having observed productivity improvement of upper-level applications in the search engine, we believe our system framework would generally benefit other data processing applications.« less
  2. Mobile apps are one of the most widely used types of software systems in existence today and more programmers and students learn how to develop them everyday. One of the most popular resources for learning mobile programming are videos hosted on social platforms such as YouTube. While useful, this type of resource has also its limitations, especially when developers are looking for user interface (UI) designs for mobile applications, since these are hard to search for and locate in videos. We propose UIScreens, a web-based analysis and search engine that analyzes the visual contents of mobile programming video tutorials, thenmore »identifies and extracts the UI screens displayed in the videos. Our tool offers features such as searching for UI screens in videos, displaying an overview of the UI screens identified in a video under each search result, and navigating to the part of a video where a particular UI screen is being displayed and discussed. In a user study, participants agreed that UIScreens is usable and useful to quickly skim through videos, while the UI screens it extracts can help developers further determine the relevance of videos to a search topic.« less
  3. Urban flooding is a major natural disaster that poses a serious threat to the urban environment. It is highly demanded that the flood extent can be mapped in near real-time for disaster rescue and relief missions, reconstruction efforts, and financial loss evaluation. Many efforts have been taken to identify the flooding zones with remote sensing data and image processing techniques. Unfortunately, the near real-time production of accurate flood maps over impacted urban areas has not been well investigated due to three major issues. (1) Satellite imagery with high spatial resolution over urban areas usually has nonhomogeneous background due to differentmore »types of objects such as buildings, moving vehicles, and road networks. As such, classical machine learning approaches hardly can model the spatial relationship between sample pixels in the flooding area. (2) Handcrafted features associated with the data are usually required as input for conventional flood mapping models, which may not be able to fully utilize the underlying patterns of a large number of available data. (3) High-resolution optical imagery often has varied pixel digital numbers (DNs) for the same ground objects as a result of highly inconsistent illumination conditions during a flood. Accordingly, traditional methods of flood mapping have major limitations in generalization based on testing data. To address the aforementioned issues in urban flood mapping, we developed a patch similarity convolutional neural network (PSNet) using satellite multispectral surface reflectance imagery before and after flooding with a spatial resolution of 3 meters. We used spectral reflectance instead of raw pixel DNs so that the influence of inconsistent illumination caused by varied weather conditions at the time of data collection can be greatly reduced. Such consistent spectral reflectance data also enhance the generalization capability of the proposed model. Experiments on the high resolution imagery before and after the urban flooding events (i.e., the 2017 Hurricane Harvey and the 2018 Hurricane Florence) showed that the developed PSNet can produce urban flood maps with consistently high precision, recall, F1 score, and overall accuracy compared with baseline classification models including support vector machine, decision tree, random forest, and AdaBoost, which were often poor in either precision or recall. The study paves the way to fuse bi-temporal remote sensing images for near real-time precision damage mapping associated with other types of natural hazards (e.g., wildfires and earthquakes).« less
  4. Lecture videos are rapidly becoming an invaluable source of information for students across the globe. Given the large number of online courses currently available, it is important to condense the information within these videos into a compact yet representative summary that can be used for search-based applications. We propose a framework to summarize whiteboard lecture videos by finding feature representations of detected handwritten content regions to determine unique content. We investigate multi-scale histogram of gradients and embeddings from deep metric learning for feature representation. We explicitly handle occluded, growing and disappearing handwritten content. Our method is capable of producing twomore »kinds of lecture video summaries - the unique regions themselves or so-called key content and keyframes (which contain all unique content in a video segment). We use weighted spatio-temporal conflict minimization to segment the lecture and produce keyframes from detected regions and features. We evaluate both types of summaries and find that we obtain state-of-the-art peformance in terms of number of summary keyframes while our unique content recall and precision are comparable to state-of-the-art.« less
  5. o fill a gap in online educational tools, we are working to support search in lecture videos using formulas from lecture notes and vice versa. We use an existing system to convert single-shot lecture videos to keyframe images that capture whiteboard contents along with the times they appear. We train classifiers for handwritten symbols using the CROHME dataset, and for LATEX symbols using generated images. Symbols detected in video keyframes and LATEX formula images are indexed using Line-of-Sight graphs. For search, we lookup pairs of symbols that can 'see' each other, and connected pairs are merged to identify the largestmore »match within each indexed image. We rank matches using symbol class probabilities and angles between symbol pairs. We demonstrate how our method effectively locates formulas between typeset and handwritten images using a set of linear algebra lectures. By combining our search engine Tangent-V) with temporal keyframe metadata, we are able to navigate to where a query formula in LATEX is first handwritten in a lecture video. Our system is available as open-source. For other domains, only the OCR modules require updating.« less