skip to main content

Title: On The Effectiveness of Active Learning by Uncertainty Sampling in Classification of High-Dimensional Gaussian Mixture Data
Active learning aims to reduce the cost of labeling through selective sampling. Despite reported empirical success over passive learning, many popular active learning heuristics such as uncertainty sampling still lack satisfying theoretical guarantees. Towards closing the gap between practical use and theoretical understanding in active learning, we propose to characterize the exact behavior of uncertainty sampling for high-dimensional Gaussian mixture data, in a modern regime of big data where the numbers of samples and features are commensurately large. Through a sharp characterization of the learning results, our analysis sheds light on the important question of when uncertainty sampling works better than passive learning. Our results show that the effectiveness of uncertainty sampling is not always ensured. In fact it depends crucially on the choice of i) an adequate initial classifier used to start the active sampling process and ii) a proper loss function that allows an adaptive treatment of samples queried at various steps.  more » « less
Award ID(s):
1813877 1846369
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Date Published:
Page Range / eLocation ID:
4238 to 4242
Medium: X
Singapore, Singapore
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of active learning for single neuron models, also sometimes called “ridge functions”, in the agnostic setting (under adversarial label noise). Such models have been shown to be broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for partial differential equations. Surprisingly, we show that for a single neuron model with any Lipschitz non-linearity (such as the ReLU, sigmoid, absolute value, low-degree polynomial, among others), strong provable approximation guarantees can be obtained using a well-known active learning strategy for fitting linear functions in the agnostic setting. Namely, we can collect samples via statistical leverage score sampling, which has been shown to be nearoptimal in other active learning scenarios. We support our theoretical results with empirical simulations showing that our proposed active learning strategy based on leverage score sampling outperforms (ordinary) uniform sampling when fitting single neuron models. 
    more » « less
  2. Within machine learning, active learning studies the gains in performance made possible by adaptively selecting data points to label. In this work, we show through upper and lower bounds, that for a simple benign setting of well-specified logistic regression on a uniform distribution over a sphere, the expected excess error of both active learning and random sampling have the same inverse proportional dependence on the number of samples. Importantly, due to the nature of lower bounds, any more general setting does not allow a better dependence on the number of samples. Additionally, we show a variant of uncertainty sampling can achieve a faster rate of convergence than random sampling by a factor of the Bayes error, a recent empirical observation made by other work. Qualitatively, this work is pessimistic with respect to the asymptotic dependence on the number of samples, but optimistic with respect to finding performance gains in the constants. 
    more » « less
  3. Xu, Jinbo (Ed.)
    Abstract Motivation Cryo-Electron Tomography (cryo-ET) is a 3D bioimaging tool that visualizes the structural and spatial organization of macromolecules at a near-native state in single cells, which has broad applications in life science. However, the systematic structural recognition and recovery of macromolecules captured by cryo-ET are difficult due to high structural complexity and imaging limits. Deep learning-based subtomogram classification has played critical roles for such tasks. As supervised approaches, however, their performance relies on sufficient and laborious annotation on a large training dataset. Results To alleviate this major labeling burden, we proposed a Hybrid Active Learning (HAL) framework for querying subtomograms for labeling from a large unlabeled subtomogram pool. Firstly, HAL adopts uncertainty sampling to select the subtomograms that have the most uncertain predictions. This strategy enforces the model to be aware of the inductive bias during classification and subtomogram selection, which satisfies the discriminativeness principle in AL literature. Moreover, to mitigate the sampling bias caused by such strategy, a discriminator is introduced to judge if a certain subtomogram is labeled or unlabeled and subsequently the model queries the subtomogram that have higher probabilities to be unlabeled. Such query strategy encourages to match the data distribution between the labeled and unlabeled subtomogram samples, which essentially encodes the representativeness criterion into the subtomogram selection process. Additionally, HAL introduces a subset sampling strategy to improve the diversity of the query set, so that the information overlap is decreased between the queried batches and the algorithmic efficiency is improved. Our experiments on subtomogram classification tasks using both simulated and real data demonstrate that we can achieve comparable testing performance (on average only 3% accuracy drop) by using less than 30% of the labeled subtomograms, which shows a very promising result for subtomogram classification task with limited labeling resources. Availability and implementation Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Abstract

    Inference is crucial in modern astronomical research, where hidden astrophysical features and patterns are often estimated from indirect and noisy measurements. Inferring the posterior of hidden features, conditioned on the observed measurements, is essential for understanding the uncertainty of results and downstream scientific interpretations. Traditional approaches for posterior estimation include sampling-based methods and variational inference (VI). However, sampling-based methods are typically slow for high-dimensional inverse problems, while VI often lacks estimation accuracy. In this paper, we proposeα-deep probabilistic inference, a deep learning framework that first learns an approximate posterior usingα-divergence VI paired with a generative neural network, and then produces more accurate posterior samples through importance reweighting of the network samples. It inherits strengths from both sampling and VI methods: it is fast, accurate, and more scalable to high-dimensional problems than conventional sampling-based approaches. We apply our approach to two high-impact astronomical inference problems using real data: exoplanet astrometry and black hole feature extraction.

    more » « less
  5. In multi-robot systems, robots often gather data to improve the performance of their deep neural networks (DNNs) for perception and planning. Ideally, these robots should select the most informative samples from their local data distributions by employing active learning approaches. However, when the data collection is distributed among multiple robots, redundancy becomes an issue as different robots may select similar data points. To overcome this challenge, we propose a fleet active learning (FAL) framework in which robots collectively select informative data samples to enhance their DNN models. Our framework leverages submodular maximization techniques to prioritize the selection of samples with high information gain. Through an iterative algorithm, the robots coordinate their efforts to collectively select the most valuable samples while minimizing communication between robots. We provide a theoretical analysis of the performance of our proposed framework and show that it is able to approximate the NP-hard optimal solution. We demonstrate the effectiveness of our framework through experiments on real-world perception and classification datasets, which include autonomous driving datasets such as Berkeley DeepDrive. Our results show an improvement by up to 25.0% in classification accuracy, 9.2% in mean average precision and 48.5% in the submodular objective value compared to a completely distributed baseline. 
    more » « less