skip to main content


Title: Detecting Traffic Incidents Using Persistence Diagrams
We introduce a novel methodology for anomaly detection in time-series data. The method uses persistence diagrams and bottleneck distances to identify anomalies. Specifically, we generate multiple predictors by randomly bagging the data (reference bags), then for each data point replacing the data point for a randomly chosen point in each bag (modified bags). The predictors then are the set of bottleneck distances for the reference/modified bag pairs. We prove the stability of the predictors as the number of bags increases. We apply our methodology to traffic data and measure the performance for identifying known incidents.  more » « less
Award ID(s):
1830254 1934884
NSF-PAR ID:
10275086
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Algorithms
Volume:
13
Issue:
9
ISSN:
1999-4893
Page Range / eLocation ID:
222
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Prediction of organismal viability upon exposure to a nanoparticle in varying environments─as fully specified at the molecular scale─has emerged as a useful figure of merit in the design of engineered nanoparticles. We build on our earlier finding that a bag of artificial neural networks (ANNs) can provide such a prediction when such machines are trained with a relatively small data set (with ca. 200 examples). Therein, viabilities were predicted by consensus using the weighted means of the predictions from the bags. Here, we confirm the accuracy and precision of the prediction of nanoparticle viabilities using an optimized bag of ANNs over sets of data examples that had not previously been used in the training and validation process. We also introduce the viability strip, rather than a single value, as the prediction and construct it from the viability probability distribution of an ensemble of ANNs compatible with the data set. Specifically, the ensemble consists of the ANNs arising from subsets of the data set corresponding to different splittings between training and validation, and the different bags (k-folds). A k−1k machine uses a single partition (or bag) of k – 1 ANNs each trained on 1/k of the data to obtain a consensus prediction, and a k-bag machine quorum samples the k possible k−1k machines available for a given partition. We find that with increasing k in the k-bag or k−1k machines, the viability strips become more normally distributed and their predictions become more precise. Benchmark comparisons between ensembles of 4-bag machines and 34 fraction machines suggest that the 34 fraction machine has similar accuracy while overcoming some of the challenges arising from divergent ANNs in the 4-bag machines. 
    more » « less
  2. null (Ed.)
    Learning from label proportions (LLP) is a weakly supervised setting for classification in which unlabeled training instances are grouped into bags, and each bag is annotated with the proportion of each class occurring in that bag. Prior work on LLP has yet to establish a consistent learning procedure, nor does there exist a theoretically justified, general purpose training criterion. In this work we address these two issues by posing LLP in terms of mutual contamination models (MCMs), which have recently been applied successfully to study various other weak supervision settings. In the process, we establish several novel technical results for MCMs, including unbiased losses and generalization error bounds under non-iid sampling plans. We also point out the limitations of a common experimental setting for LLP, and propose a new one based on our MCM framework. 
    more » « less
  3. null (Ed.)
    Learning from label proportions (LLP) is a weakly supervised setting for classification in whichunlabeled training instances are grouped into bags, and each bag is annotated with the proportion ofeach class occurring in that bag. Prior work on LLP has yet to establish a consistent learning procedure,nor does there exist a theoretically justified, general purpose training criterion. In this work we addressthese two issues by posing LLP in terms of mutual contamination models (MCMs), which have recentlybeen applied successfully to study various other weak supervision settings. In the process, we establishseveral novel technical results for MCMs, including unbiased losses and generalization error bounds undernon-iid sampling plans. We also point out the limitations ofa common experimental setting for LLP,and propose a new one based on our MCM framework. 
    more » « less
  4. Learning from label proportions (LLP) is a weakly supervised classification problem where data points are grouped into bags, and the label proportions within each bag are observed instead of the instance-level labels. The task is to learn a classifier to predict the labels of future individual instances. Prior work on LLP for multi-class data has yet to develop a theoretically grounded algorithm. In this work, we propose an approach to LLP based on a reduction to learning with label noise, using the forward correction (FC) loss of Patrini et al. [30]. We establish an excess risk bound and generalization error analysis for our approach, while also extending the theory of the FC loss which may be of independent interest. Our approach demonstrates improved empirical performance in deep learning scenarios across multiple datasets and architectures, compared to the leading methods. 
    more » « less
  5. Learning from label proportions (LLP) is a weakly supervised classification problem where data points are grouped into bags, and the label proportions within each bag are observed instead of the instance-level labels. The task is to learn a classifier to predict the labels of future individual instances. Prior work on LLP for multi-class data has yet to develop a theoretically grounded algorithm. In this work, we propose an approach to LLP based on a reduction to learning with label noise, using the forward correction (FC) loss of Patrini et al. [30]. We establish an excess risk bound and generalization error analysis for our approach, while also extending the theory of the FC loss which may be of independent interest. Our approach demonstrates improved empirical performance in deep learning scenarios across multiple datasets and architectures, compared to the leading methods. 
    more » « less