A common problem practitioners face is to select rare events in a large dataset. Unfortunately, standard techniques ranging from pre-trained models to active learning do not leverage proximity structure present in many datasets and can lead to worse-than-random results. To address this, we propose EZMODE, an algorithm for iterative selection of rare events in large, unlabeled datasets. EZMODE leverages active learning to iteratively train classifiers, but chooses the easiest positive examples to label in contrast to standard uncertainty techniques. EZMODE also leverages proximity structure (e.g., temporal sampling) to find difficult positive examples. We show that EZMODE can outperform baselines by up to 130× on a novel, real-world, 9,000 GB video dataset.
more »
« less
Similarity Search for Efficient Active Learning and Search of Rare Concepts
Many active learning and search approaches are intractable for large-scale industrial settings with billions of unlabeled examples. Existing approaches search globally for the optimal examples to label, scaling linearly or even quadratically with the unlabeled data. In this paper, we improve the computational efficiency of active learning and search methods by restricting the candidate pool for labeling to the nearest neighbors of the currently labeled set instead of scanning over all of the unlabeled data. We evaluate several selection strategies in this setting on three large-scale computer vision datasets: ImageNet, OpenImages, and a de-identified and aggregated dataset of 10 billion publicly shared images provided by a large internet company. Our approach achieved similar mean average precision and recall as the traditional global approach while reducing the computational cost of selection by up to three orders of magnitude, enabling web-scale active learning.
more »
« less
- Award ID(s):
- 2023239
- PAR ID:
- 10333348
- Date Published:
- Journal Name:
- Proceedings of the AAAI Conference on Artificial Intelligence
- ISSN:
- 2159-5399
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We study a special paradigm of active learning, called cost effective active search, where the goal is to find a given number of positive points from a large unlabeled pool with minimum labeling cost. Most existing methods solve this problem heuristically, and few theoretical results have been established. We adopt a principled Bayesian approach for the first time. We first derive the Bayesian optimal policy and establish a strong hardness result: the optimal policy is hard to approximate, with the best-possible approximation ratio lower bounded by Ω(n^0.16). We then propose an efficient and nonmyopic policy using the negative Poisson binomial distribution. We propose simple and fast approximations for computing its expectation, which serves as an essential role in our proposed policy. We conduct comprehensive experiments on various domains such as drug and materials discovery, and demonstrate that our proposed search procedure is superior to the widely used greedy baseline.more » « less
-
In-context learning (ICL) empowers large language models (LLMs) to tackle new tasks by using a series of training instances as prompts. Since generating the prompts needs to sample from a vast pool of instances and annotate them (e.g., add labels in classification task), existing methods have proposed to select a subset of unlabeled examples for annotation, thus enhancing the quality of prompts and concurrently mitigating annotation costs. However, these methods often require a long time to select instances due to their complexity, hindering their practical viability. To address this limitation, we propose a graph-based selection method, FastGAS, designed to efficiently identify high-quality instances while minimizing computational overhead. Initially, we construct a data similarity graph based on instance similarities. Subsequently, employing a graph partitioning algorithm, we partition the graph into pieces. Within each piece (i.e., subgraph), we adopt a greedy approach to pick the most representative nodes. By aggregating nodes from diverse pieces and annotating the corresponding instances, we identify a set of diverse and representative instances for ICL. Compared to prior approaches, our method not only exhibits superior performance on different tasks but also significantly reduces selection time. In addition, we demonstrate the efficacy of our approach in LLMs of larger sizes.more » « less
-
null (Ed.)We propose a semi-supervised learning approach for video classification, VideoSSL, using convolutional neural networks (CNN). Like other computer vision tasks, existing supervised video classification methods demand a large amount of labeled data to attain good performance. However, annotation of a large dataset is expensive and time consuming. To minimize the dependence on a large annotated dataset, our proposed semi-supervised method trains from a small number of labeled examples and exploits two regulatory signals from unlabeled data. The first signal is the pseudo-labels of unlabeled examples computed from the confidences of the CNN being trained. The other is the normalized probabilities, as predicted by an image classifier CNN, that captures the information about appearances of the interesting objects in the video. We show that, under the supervision of these guiding signals from unlabeled examples, a video classification CNN can achieve impressive performances utilizing a small fraction of annotated examples on three publicly available datasets: UCF101, HMDB51, and Kinetics.more » « less
-
Sensor fusion approaches combine data from a suite of sensors into an integrated solution that represents the target environment more accurately than that produced by an individual sensor. Deep learning (DL) based approaches can address challenges with sensor fusion more accurately than classical approaches. However, the accuracy of the selected approach can change when sensors are modified, upgraded or swapped out within the system of sensors. Historically, this can require an expensive manual refactor of the sensor fusion solution.This paper develops 12 DL-based sensor fusion approaches and proposes a systematic and iterative methodology for selecting an optimal DL approach and hyperparameter settings simultaneously. The Gradient Descent Multi-Algorithm Grid Search (GD-MAGS) methodology is an iterative grid search technique enhanced by gradient descent predictions and expanded to exchange performance measure information across concurrently running DL-based approaches. Additionally, at each iteration, the worst two performing DL approaches are pruned to reduce the resource usage as computational expense increases from hyperparameter tuning. We evaluate this methodology using an open source, time-series aircraft data set trained on the aircraft’s altitude using multi-modal sensors that measure variables such as velocities, accelerations, pressures, temperatures, and aircraft orientation and position. We demonstrate the selection of an optimal DL model and an increase of 88% in model accuracy compared to the other 11 DL approaches analyzed. Verification of the model selected shows that it outperforms pruned models on data from other aircraft with the same system of sensors.more » « less
An official website of the United States government

