skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The DOI auto-population feature in the Public Access Repository (PAR) will be unavailable from 4:00 PM ET on Tuesday, July 8 until 4:00 PM ET on Wednesday, July 9 due to scheduled maintenance. We apologize for the inconvenience caused.


Title: Image Retrieval with Similar Object Detection and Local Similarity to Detected Objects
Commercial image search applications like eBay and Pinterest allow users to select the focused area as bounding box over the query images, which improves the retrieval accuracy. The focused area image retrieval strategy motivated our research, but our system has three main advantages over the existing works. (1) Given a query focus area, our approach localizes the most similar region in the database image and only this region is used for computing image similarity. This is done in a unified network whose weights are adjusted both for localization and similarity learning in an end-to-end manner. (2) This is achieved using fewer than five proposals extracted from a saliency map, which speedups the pairwise similarity computation. Usually hundreds or even thousands of proposals are used for localization. (3) For users, our system explains the relevance of the retrieved results by locating the regions in database images most similar to query object. Our method achieves significantly better retrieval performance than the off-the-shelf object localization-based retrieval methods and end-to-end trained triplet method with a region proposal network. Our experimental results demonstrate 86% retrieval rate as compared to 73% achieved by the existing methods on PASCAL VOC07 and VOC12 datasets. Extensive experiments are also conducted on the instance retrieval databases Oxford5k and INSTRE, wherewe exhibit competitive performance. Finally, we provide both quantitative and qualitative results of our retrieval method demonstrating its superiority over commercial image search systems.  more » « less
Award ID(s):
1814745
PAR ID:
10109410
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Pacific Rim Int. Conf. on Artificial Intelligence
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we will present SketchQL, a video database management system (VDBMS) for retrieving video moments with a sketch-based query interface. This novel interface allows users to specify object trajectory events with simple mouse drag-and-drop operations. Users can use trajectories of single objects as building blocks to compose complex events. Using a pre-trained model that encodes trajectory similarity, SketchQL achieves zero-shot video moments retrieval by performing similarity searches over the video to identify clips that are the most similar to the visual query. In this demonstration, we introduce the graphic user interface of SketchQL and detail its functionalities and interaction mechanisms. We also demonstrate the end-to-end usage of SketchQL from query composition to video moments retrieval using real-world scenarios. 
    more » « less
  2. This work presents OVIR-3D, a straightforward yet effective method for open-vocabulary 3D object instance retrieval without using any 3D data for training. Given a language query, the proposed method is able to return a ranked set of 3D object instance segments based on the feature similarity of the instance and the text query. This is achieved by a multi-view fusion of text-aligned 2D region proposals into 3D space, where the 2D region proposal network could leverage 2D datasets, which are more accessible and typically larger than 3D datasets. The proposed fusion process is efficient as it can be performed in real-time for most indoor 3D scenes and does not require additional training in 3D space. Experiments on public datasets and a real robot show the effectiveness of the method and its potential for applications in robot navigation and manipulation. 
    more » « less
  3. Vision-based localization approaches now underpin newly emerging navigation pipelines for myriad use cases, from robotics to assistive technologies. Compared to sensor-based solutions, vision-based localization does not require pre-installed sensor infrastructure, which is costly, time-consuming, and/or often infeasible at scale. Herein, we propose a novel vision-based localization pipeline for a specific use case: navigation support for end users with blindness and low vision. Given a query image taken by an end user on a mobile application, the pipeline leverages a visual place recognition (VPR) algorithm to find similar images in a reference image database of the target space. The geolocations of these similar images are utilized in a downstream task that employs a weighted-average method to estimate the end user’s location. Another downstream task utilizes the perspective-n-point (PnP) algorithm to estimate the end user’s direction by exploiting the 2D–3D point correspondences between the query image and the 3D environment, as extracted from matched images in the database. Additionally, this system implements Dijkstra’s algorithm to calculate a shortest path based on a navigable map that includes the trip origin and destination. The topometric map used for localization and navigation is built using a customized graphical user interface that projects a 3D reconstructed sparse map, built from a sequence of images, to the corresponding a priori 2D floor plan. Sequential images used for map construction can be collected in a pre-mapping step or scavenged through public databases/citizen science. The end-to-end system can be installed on any internet-accessible device with a camera that hosts a custom mobile application. For evaluation purposes, mapping and localization were tested in a complex hospital environment. The evaluation results demonstrate that our system can achieve localization with an average error of less than 1 m without knowledge of the camera’s intrinsic parameters, such as focal length. 
    more » « less
  4. Recently, contrastive learning has achieved great results in self-supervised learning, where the main idea is to pull two augmentations of an image (positive pairs) closer compared to other random images (negative pairs). We argue that not all negative images are equally negative. Hence, we introduce a self-supervised learning algorithm where we use a soft similarity for the negative images rather than a binary distinction between positive and negative pairs. We iteratively distill a slowly evolving teacher model to the student model by capturing the similarity of a query image to some random images and transferring that knowledge to the student. Specifically, our method should handle unbalanced and unlabeled data better than existing contrastive learning methods, because the randomly chosen negative set might include many samples that are semantically similar to the query image. In this case, our method labels them as highly similar while standard contrastive methods label them as negatives. Our method achieves comparable results to the state-of-the-art models. 
    more » « less
  5. Localizing video moments based on the movement patterns of objects is an important task in video analytics. Existing video analytics systems offer two types of querying interfaces based on natural language and SQL, respectively. However, both types of interfaces have major limitations. SQL-based systems require high query specification time, whereas natural language-based systems require large training datasets to achieve satisfactory retrieval accuracy. To address these limitations, we present SketchQL, a video database management system (VDBMS) for offline, exploratory video moment retrieval that is both easy to use and generalizes well across multiple video moment datasets. To improve ease-of-use, SketchQL features avisual query interfacethat enables users to sketch complex visual queries through intuitive drag-and-drop actions. To improve generalizability, SketchQL operates on object-tracking primitives that are reliably extracted across various datasets using pre-trained models. We present a learned similarity search algorithm for retrieving video moments closely matching the user's visual query based on object trajectories. SketchQL trains the model on a diverse dataset generated with a novel simulator, that enhances its accuracy across a wide array of datasets and queries. We evaluate SketchQL on four real-world datasets with nine queries, demonstrating its superior usability and retrieval accuracy over state-of-the-art VDBMSs. 
    more » « less