skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, April 12 until 2:00 AM ET on Saturday, April 13 due to maintenance. We apologize for the inconvenience.

Title: Similarity Search for Efficient Active Learning and Search of Rare Concepts
Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a de-identified and aggregated dataset of 10 billion publicly shared images provided by a large internet company. Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, enabling web-scale active learning.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A common problem practitioners face is to select rare events in a large dataset. Unfortunately, standard techniques ranging from pre-trained models to active learning do not leverage proximity structure present in many datasets and can lead to worse-than-random results. To address this, we propose EZMODE, an algorithm for iterative selection of rare events in large, unlabeled datasets. EZMODE leverages active learning to iteratively train classifiers, but chooses the easiest positive examples to label in contrast to standard uncertainty techniques. EZMODE also leverages proximity structure (e.g., temporal sampling) to find difficult positive examples. We show that EZMODE can outperform baselines by up to 130× on a novel, real-world, 9,000 GB video dataset. 
    more » « less
  2. There are many applications where positive instances are rare but important to identify. For example, in NLP, positive sentences for a given relation are rare in a large corpus. Positive data are more informative for learning in these applications, but before one labels a certain amount of data, it is unknown where to find the rare positives. Since random sampling can lead to significant waste in labeling effort, previous ”active search” methods use a single bandit model to learn about the data distribution (exploration) while sampling from the regions potentially containing more positives (exploitation). Many bandit models are possible and a sub-optimal model reduces labeling efficiency, but the optimal model is unknown before any data are labeled. We propose Meta-AS (Meta Active Search) that uses a meta-bandit to evaluate a set of base bandits and aims to label positive examples efficiently, comparing to the optimal base bandit with hindsight. The meta-bandit estimates the mean and variance of the performance of the base bandits, and selects a base bandit to propose what data to label next for exploration or exploitation. The feedback in the labels updates both the base bandits and the meta-bandit for the next round. Meta-AS can accommodate a diverse set of base bandits to explore assumptions about the dataset, without over-committing to a single model before labeling starts. Experiments on five datasets for relation extraction demonstrate that Meta-AS labels positives more efficiently than the base bandits and other bandit selection strategies. 
    more » « less
  3. Sensor fusion approaches combine data from a suite of sensors into an integrated solution that represents the target environment more accurately than that produced by an individual sensor. Deep learning (DL) based approaches can address challenges with sensor fusion more accurately than classical approaches. However, the accuracy of the selected approach can change when sensors are modified, upgraded or swapped out within the system of sensors. Historically, this can require an expensive manual refactor of the sensor fusion solution.This paper develops 12 DL-based sensor fusion approaches and proposes a systematic and iterative methodology for selecting an optimal DL approach and hyperparameter settings simultaneously. The Gradient Descent Multi-Algorithm Grid Search (GD-MAGS) methodology is an iterative grid search technique enhanced by gradient descent predictions and expanded to exchange performance measure information across concurrently running DL-based approaches. Additionally, at each iteration, the worst two performing DL approaches are pruned to reduce the resource usage as computational expense increases from hyperparameter tuning. We evaluate this methodology using an open source, time-series aircraft data set trained on the aircraft’s altitude using multi-modal sensors that measure variables such as velocities, accelerations, pressures, temperatures, and aircraft orientation and position. We demonstrate the selection of an optimal DL model and an increase of 88% in model accuracy compared to the other 11 DL approaches analyzed. Verification of the model selected shows that it outperforms pruned models on data from other aircraft with the same system of sensors. 
    more » « less
  4. We study a special paradigm of active learning, called cost effective active search, where the goal is to find a given number of positive points from a large unlabeled pool with minimum labeling cost. Most existing methods solve this problem heuristically, and few theoretical results have been established. We adopt a principled Bayesian approach for the first time. We first derive the Bayesian optimal policy and establish a strong hardness result: the optimal policy is hard to approximate, with the best-possible approximation ratio lower bounded by Ω(n^0.16). We then propose an efficient and nonmyopic policy using the negative Poisson binomial distribution. We propose simple and fast approximations for computing its expectation, which serves as an essential role in our proposed policy. We conduct comprehensive experiments on various domains such as drug and materials discovery, and demonstrate that our proposed search procedure is superior to the widely used greedy baseline. 
    more » « less
  5. Abstract Inspirational stimuli are known to be effective in supporting ideation during early-stage design. However, prior work has predominantly constrained designers to using text-only queries when searching for stimuli, which is not consistent with real-world design behavior where fluidity across modalities (e.g., visual, semantic, etc.) is standard practice. In the current work, we introduce a multi-modal search platform that retrieves inspirational stimuli in the form of 3D-model parts using text, appearance, and function-based search inputs. Computational methods leveraging a deep-learning approach are presented for designing and supporting this platform, which relies on deep-neural networks trained on a large dataset of 3D-model parts. This work further presents the results of a cognitive study ( n = 21) where the aforementioned search platform was used to find parts to inspire solutions to a design challenge. Participants engaged with three different search modalities: by keywords, 3D parts, and user-assembled 3D parts in their workspace. When searching by parts that are selected or in their workspace, participants had additional control over the similarity of appearance and function of results relative to the input. The results of this study demonstrate that the modality used impacts search behavior, such as in search frequency, how retrieved search results are engaged with, and how broadly the search space is covered. Specific results link interactions with the interface to search strategies participants may have used during the task. Findings suggest that when searching for inspirational stimuli, desired results can be achieved both by direct search inputs (e.g., by keyword) as well as by more randomly discovered examples, where a specific goal was not defined. Both search processes are found to be important to enable when designing search platforms for inspirational stimuli retrieval. 
    more » « less