skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Analysis of Features in a Sliding Threshold of Observation for Numeric Evaluation (STONE) Curve
Abstract We apply idealized scatterplot distributions to the sliding threshold of observation for numeric evaluation (STONE) curve, a new model assessment metric, to examine the relationship between the STONE curve and the underlying point‐spread distribution. The STONE curve is based on the relative operating characteristic (ROC) curve but is developed to work with a continuous‐valued set of observations, sweeping both the observed and modeled event identification threshold simultaneously. This is particularly useful for model predictions of time series data as is the case for much of terrestrial weather and space weather. The identical sweep of both the model and observational thresholds results in changes to both the modeled and observed event states as the quadrant boundaries shift. The changes in a data‐model pair's event status result in nonmonotonic features to appear in the STONE curve when compared to an ROC curve for the same observational and model data sets. Such features reveal characteristics in the underlying distributions of the data and model values. Many idealized data sets were created with known distributions, connecting certain scatterplot features to distinct STONE curve signatures. A comprehensive suite of feature‐signature combinations is presented, including their relationship to several other metrics. It is shown that nonmonotonic features appear if a local spread is more than 0.2 of the full domain or if a local bias is more than half of the local spread. The example of real‐time plasma sheet electron modeling is used to show the usefulness of this technique, especially in combination with other metrics.  more » « less
Award ID(s):
1663770
PAR ID:
10375347
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
DOI PREFIX: 10.1029
Date Published:
Journal Name:
Space Weather
Volume:
20
Issue:
6
ISSN:
1542-7390
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Many developers of biometric systems start with modest samples before general deployment. However, they are interested in how their systems will work with much larger samples. To assist them, we evaluated the effect of gallery size on biometric performance. Identification rates describe the performance of biometric identification, whereas ROC-based measures describe the performance of biometric authentication (verification). Therefore, we examined how increases in gallery size affected identification rates (i.e., Rank-1 Identification Rate, or Rank-1 IR) and ROC-based measures such as equal error rate (EER). We studied these phenomena with synthetic data as well as real data from a face recognition study. It is well known that the Rank-1 IR declines with increasing gallery size, and that the relationship is linear against log(gallery size). We have confirmed this with synthetic and real data. We have shown that this decline can be counteracted with the inclusion of additional information (features) for larger gallery sizes. We have also described the curves which can be used to predict how much additional information would be required to stabilize the Rank-1 IR as a function of gallery size. These equations are also linear in log(gallery size). We have also shown that the entire ROC-curve was not systematically affected by gallery size, and so ROC-based scalar performance metrics such as EER are also stable across gallery size. Unsurprisingly, as additional uncorrelated features are added to the model, EER decreases. We were interested in determining the impact of adding more features on the median, spread and shape of similarity score distributions. We present evidence that these decreases in EER are driven primarily by decreases in the spread of the impostor similarity score distribution. 
    more » « less
  2. The optimal receiver operating characteristic (ROC) curve, giving the maximum probability of detection as a function of the probability of false alarm, is a key information-theoretic indicator of the difficulty of a binary hypothesis testing problem (BHT). It is well known that the optimal ROC curve for a given BHT, corresponding to the likelihood ratio test, is theoretically determined by the probability distribution of the observed data under each of the two hypotheses. In some cases, these two distributions may be unknown or computationally intractable, but independent samples of the likelihood ratio can be observed. This raises the problem of estimating the optimal ROC for a BHT from such samples. The maximum likelihood estimator of the optimal ROC curve is derived, and it is shown to converge to the true optimal ROC curve in the \levy\ metric, as the number of observations tends to infinity. A classical empirical estimator, based on estimating the two types of error probabilities from two separate sets of samples, is also considered. The maximum likelihood estimator is observed in simulation experiments to be considerably more accurate than the empirical estimator, especially when the number of samples obtained under one of the two hypotheses is small. The area under the maximum likelihood estimator is derived; it is a consistent estimator of the true area under the optimal ROC curve. 
    more » « less
  3. Self-attention transformers have demonstrated accuracy for image classification with smaller data sets. However, a limitation is that tests to-date are based upon single class image detection with known representation of image populations. For instances where the input image classes may be greater than one and test sets that lack full information on representation of image populations, accuracy calculations must adapt. The Receiver Operating Characteristic (ROC) accuracy thresh-old can address the instances of multi-class input images. However, this approach is unsuitable in instances where image population representation is unknown. We consider calculating accuracy using the knee method to determine threshold values on an ad-hoc basis. Results of ROC curve and knee thresholds for a multi-class data set, created from CIFAR-10 images, are discussed for multi-class image detection. 
    more » « less
  4. This paper presents a multiscale modeling framework (MMF) to model moist at- mospheric limited-area weather. The MMF resolves large-scale convection using a coarse grid while simultaneously resolving local features through numerous fine local grids and coupling them seamlessly. Both large- and small-scale processes are modeled using the compressible Navier-Stokes equations within the Nonhydrostatic Unified Model of the Atmosphere (NUMA), and they are discretized using a continuous element-based Galerkin method (spectral elements) with high-order basis functions. Consequently, the large-scale and small-scale models share the same dynamical core but have the flexibility to be ad- justed individually. The proposed MMF method is tested in 2D and 3D idealized limited- area weather problems involving storm clouds produced by squall line and supercell sim- ulations. The MMF numerical results showed enhanced representation of cloud processes compared to the coarse model. 
    more » « less
  5. Research has produced many types of authentication systems that use machine learning. However, there is no consistent approach for reporting performance metrics and the reported metrics are inadequate. In this work, we show that several of the common metrics used for reporting performance, such as maximum accuracy (ACC), equal error rate (EER) and area under the ROC curve (AUROC), are inherently flawed. These common metrics hide the details of the inherent trade-offs a system must make when implemented. Our findings show that current metrics give no insight into how system performance degrades outside the ideal conditions in which they were designed. We argue that adequate performance reporting must be provided to enable meaningful evaluation and that current, commonly used approaches fail in this regard. We present the unnormalized frequency count of scores (FCS) to demonstrate the mathematical underpinnings that lead to these failures and show how they can be avoided. The FCS can be used to augment the performance reporting to enable comparison across systems in a visual way. When reported with the Receiver Operating Characteristics curve (ROC), these two metrics provide a solution to the limitations of currently reported metrics. Finally, we show how to use the FCS and ROC metrics to evaluate and compare different authentication systems. 
    more » « less