skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: More Accurate Learning of k-DNF Reference Classes
In machine learning, predictors trained on a given data distribution are usually guaranteed to perform well for further examples from the same distribution on average. This often may involve disregarding or diminishing the predictive power on atypical examples; or, in more extreme cases, a data distribution may be composed of a mixture of individually “atypical” heterogeneous populations, and the kind of simple predictors we can train may find it difficult to fit all of these populations simultaneously. In such cases, we may wish to make predictions for an atypical point by selecting a suitable reference class for that point: a subset of the data that is “more similar” to the given query point in an appropriate sense. Closely related tasks also arise in applications such as diagnosis or explaining the output of classifiers. We present new algorithms for computing k-DNF reference classes and establish much stronger approximation guarantees for their error rates.  more » « less
Award ID(s):
1718380
PAR ID:
10187835
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
34
Issue:
04
ISSN:
2159-5399
Page Range / eLocation ID:
4385 to 4393
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shifts. Such shifts may be introduced by initial training data uncertainties, user adaptation to a deployed predictor, dynamic environments, or the use of pre-trained models in new settings. Herein, we develop a bound that characterizes such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violations as our primary result. We then develop bounds for specific worked examples, focusing on two commonly used fairness definitions (i.e., demographic parity and equalized odds) and two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift and against real-world data, finding that we are able to estimate fairness violation bounds in practice, even when simplifying assumptions are only approximately satisfied. 
    more » « less
  2. How quickly can a given class of concepts be learned from examples? It is common to measure the performance of a supervised machine learning algorithm by plotting its “learning curve”, that is, the decay of the error rate as a function of the number of training examples. However, the classical theoretical framework for understanding learnability, the PAC model of Vapnik-Chervonenkis and Valiant, does not explain the behavior of learning curves: the distribution-free PAC model of learning can only bound the upper envelope of the learning curves over all possible data distributions. This does not match the practice of machine learning, where the data source is typically fixed in any given scenario, while the learner may choose the number of training examples on the basis of factors such as computational resources and desired accuracy. In this paper, we study an alternative learning model that better captures such practical aspects of machine learning, but still gives rise to a complete theory of the learnable in the spirit of the PAC model. More precisely, we consider the problem of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution. The main result of this paper is a remarkable trichotomy: there are only three possible rates of universal learning. More precisely, we show that the learning curves of any given concept class decay either at an exponential, linear, or arbitrarily slow rates. Moreover, each of these cases is completely characterized by appropriate combinatorial parameters, and we exhibit optimal learning algorithms that achieve the best possible rate in each case. For concreteness, we consider in this paper only the realizable case, though analogous results are expected to extend to more general learning scenarios. 
    more » « less
  3. null (Ed.)
    We introduce a novel methodology for anomaly detection in time-series data. The method uses persistence diagrams and bottleneck distances to identify anomalies. Specifically, we generate multiple predictors by randomly bagging the data (reference bags), then for each data point replacing the data point for a randomly chosen point in each bag (modified bags). The predictors then are the set of bottleneck distances for the reference/modified bag pairs. We prove the stability of the predictors as the number of bags increases. We apply our methodology to traffic data and measure the performance for identifying known incidents. 
    more » « less
  4. We introduce a method for approximating the signed distance function (SDF) of geometry corrupted by holes, noise, or self-intersections. The method implicitly defines a completed version of the shape, rather than explicitly repairing the given input. Our starting point is a modified version of theheat methodfor geodesic distance, which diffuses normal vectors rather than a scalar distribution. This formulation provides robustness akin togeneralized winding numbers (GWN), but provides distance function rather than just an inside/outside classification. Our formulation also offers several features not common to classic distance algorithms, such as the ability to simultaneously fit multiple level sets, a notion of distance for geometry that does not topologically bound any region, and the ability to mix and match signed and unsigned distance. The method can be applied in any dimension and to any spatial discretization, including triangle meshes, tet meshes, point clouds, polygonal meshes, voxelized surfaces, and regular grids. We evaluate the method on several challenging examples, implementing normal offsets and other morphological operations directly on imperfect curve and surface data. In many cases we also obtain an inside/outside classification dramatically more robust than the one obtained provided by GWN. 
    more » « less
  5. ABSTRACT Within the contiguous USA, Florida is unique in having tropical and subtropical climates, a great abundance and diversity of mosquito vectors, and high rates of human travel. These factors contribute to the state being the national ground zero for exotic mosquito-borne diseases, as evidenced by local transmission of viruses spread by Aedes aegypti, including outbreaks of dengue in 2022 and Zika in 2016. Because of limited treatment options, integrated vector management is a key part of mitigating these arboviruses. Practical knowledge of when and where mosquito populations of interest exist is critical for surveillance and control efforts, and habitat predictions at various geographic scales typically rely on ecological niche modeling. However, most of these models, usually created in partnership with academic institutions, demand resources that otherwise may be too time-demanding or difficult for mosquito control programs to replicate and use effectively. Such resources may include intensive computational requirements, high spatiotemporal resolutions of data not regularly available, and/or expert knowledge of statistical analysis. Therefore, our study aims to partner with mosquito control agencies in generating operationally useful mosquito abundance models. Given the increasing threat of mosquito-borne disease transmission in Florida, our analytic approach targets recent Ae. aegypti abundance in the Tampa Bay area. We investigate explanatory variables that: 1) are publicly available, 2) require little to no preprocessing for use, and 3) are known factors associated with Ae. aegypti ecology. Out of our 4 final models, none required more than 5 out of the 36 predictors assessed (13.9%). Similar to previous literature, the strongest predictors were consistently 3- and 4-wk temperature and precipitation lags, followed closely by 1 of 2 environmental predictors: land use/land cover or normalized difference vegetation index. Surprisingly, 3 of our 4 final models included one or more socioeconomic or demographic predictors. In general, larger sample sizes of trap collections and/or citizen science observations should result in greater confidence in model predictions and validation. However, given disparities in trap collections across jurisdictions, individual county models rather than a multicounty conglomerate model would likely yield stronger model fits. Ultimately, we hope that the results of our assessment will enable more accurate and precise mosquito surveillance and control of Ae. aegypti in Florida and beyond. 
    more » « less