skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Conformalized Semi-supervised Random Forest for Classification and Abnormality Detection
The Random Forests classifier, a widely utilized off-the-shelf classification tool, assumes training and test samples come from the same distribution as other standard classifiers. However, in safety-critical scenarios like medical diagnosis and network attack detection, discrepancies between the training and test sets, including the potential presence of novel outlier samples not appearing during training, can pose significant challenges. To address this problem, we introduce the Conformalized Semi-Supervised Random Forest (CSForest), which couples the conformalization technique Jackknife+aB with semi-supervised tree ensembles to construct a set-valued prediction đ¶(đ‘„). Instead of optimizing over the training distribution, CSForest employs unlabeled test samples to enhance accuracy and flag unseen outliers by generating an empty set. Theoretically, we establish CSForest to cover true labels for previously observed inlier classes under arbitrarily label-shift in the test data. We compare CSForest with state-of-the-art methods using synthetic examples and various real-world datasets, under different types of distribution changes in the test domain. Our results highlight CSForest’s effective prediction of inliers and its ability to detect outlier samples unique to the test data. In addition, CSForest shows persistently good performance as the sizes of the training and test sets vary. Codes of CSForest are available at https://github.com/yujinhan98/CSForest.  more » « less
Award ID(s):
2310836
PAR ID:
10511464
Author(s) / Creator(s):
; ;
Publisher / Repository:
PMLR
Date Published:
Journal Name:
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics
ISSN:
2640-3498
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Smart grids are facing many challenges including cyber-attacks which can cause devastating damages to the grids. Existing machine learning based approaches for detecting cyber-attacks in smart grids are mainly based on supervised learning, which needs representative instances from various attack types to obtain good detection models. In this paper, we investigated semi-supervised outlier detection algorithms for this problem which only use instances of normal events for model training. Data collected by phasor measurement units (PMUs) was used for training the detection model. The semi-supervised outlier detection algorithms were augmented with deep feature extraction for enhanced detection performance. Our results show that semi-supervised outlier detection algorithms can perform better than popular supervised algorithms. Deep feature extraction can significantly improve the performance of semi-supervised algorithms for detecting cyber-attacks in smart grids 
    more » « less
  2. Chaudhuri, Kamalika; Jegelka, Stefanie; Song, Le; Szepesyari, Csaba; Niu, Gang; Sabato, Sivan (Ed.)
    Few-shot classification (FSC) requires training models using a few (typically one to five) data points per class. Meta-learning has proven to be able to learn a parametrized model for FSC by training on various other classification tasks. In this work, we propose PLATINUM (semi-suPervised modeL Agnostic meTa learnIng usiNg sUbmodular Mutual information ), a novel semi-supervised model agnostic meta learning framework that uses the submodular mutual in- formation (SMI) functions to boost the perfor- mance of FSC. PLATINUM leverages unlabeled data in the inner and outer loop using SMI func- tions during meta-training and obtains richer meta- learned parameterizations. We study the per- formance of PLATINUM in two scenarios - 1) where the unlabeled data points belong to the same set of classes as the labeled set of a cer- tain episode, and 2) where there exist out-of- distribution classes that do not belong to the la- beled set. We evaluate our method on various settings on the miniImageNet, tieredImageNet and CIFAR-FS datasets. Our experiments show that PLATINUM outperforms MAML and semi- supervised approaches like pseduo-labeling for semi-supervised FSC, especially for small ratio of labeled to unlabeled samples. 
    more » « less
  3. Few-shot classification (FSC) requires training models using a few (typically one to five) data points per class. Meta learning has proven to be able to learn a parametrized model for FSC by training on various other classification tasks. In this work, we propose PLATINUM (semi-suPervised modeL Agnostic meTa-learnIng usiNg sUbmodular Mutual information), a novel semi-supervised model agnostic meta-learning framework that uses the submodular mutual information (SMI) functions to boost the performance of FSC. PLATINUM leverages unlabeled data in the inner and outer loop using SMI functions during meta-training and obtains richer meta-learned parameterizations for meta-test. We study the performance of PLATINUM in two scenarios - 1) where the unlabeled data points belong to the same set of classes as the labeled set of a certain episode, and 2) where there exist out-of-distribution classes that do not belong to the labeled set. We evaluate our method on various settings on the miniImageNet, tieredImageNet and Fewshot-CIFAR100 datasets. Our experiments show that PLATINUM outperforms MAML and semi-supervised approaches like pseduo-labeling for semi-supervised FSC, especially for small ratio of labeled examples per class. 
    more » « less
  4. This paper addresses the challenge of learning under arbitrary distribution shift, where models are trained on data from one distribution but evaluated on a potentially adversarially generated test distribution. The authors study two frameworks: PQ learning [Goldwasser, A. Kalai, Y. Kalai, Montasser NeurIPS 2020], which allows abstention on adversarial portions of the test distribution, and TDS learning [Klivans, Stavropoulos, Vasilyan COLT 2024], which permits abstention on the entire test set if a distribution shift is detected. Previous algorithms either relied on computationally hard primitives (even for simple classes) or abstained entirely with even small shifts. This work develops efficient algorithms for natural function classes such as intersections of halfspaces and decision trees, under standard training distributions like Gaussians. The key innovation is an improved analysis of spectral outlier-removal techniques from learning with nasty noise, enabling: (1) handling arbitrarily large fractions of outliers, crucial for arbitrary shifts, and (2) stronger bounds on polynomial moments post outlier-removal, yielding new insights into polynomial regression under distribution shift. The techniques also produce novel results for tolerant testable learning [Rubinfeld and Vasilyan STOC 2023] and learning with nasty noise. 
    more » « less
  5. We consider a family of problems that are concerned about making predictions for the majority of unlabeled, graph-structured data samples based on a small proportion of labeled samples. Relational information among the data samples, often encoded in the graph/network structure, is shown to be helpful for these semi-supervised learning tasks. However, conventional graph-based regularization methods and recent graph neural networks do not fully leverage the interrelations between the features, the graph, and the labels. In this work, we propose a flexible generative framework for graph-based semi-supervised learning, which approaches the joint distribution of the node features, labels, and the graph structure. Borrowing insights from random graph models in network science literature, this joint distribution can be instantiated using various distribution families. For the inference of missing labels, we exploit recent advances of scalable variational inference techniques to approximate the Bayesian posterior. We conduct thorough experiments on benchmark datasets for graph-based semi-supervised learning. Results show that the proposed methods outperform the state-of-the-art models in most settings. 
    more » « less