skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Encoding of data sets and algorithms
In many high-impact applications, it is important to ensure the quality of the output of a machine learning algorithm as well as its reliability in comparison to the complexity of the algorithm used. In this paper, we have initiated a mathematically rigorous theory to decide which models (algorithms applied on data sets) are close to each other in terms of certain metrics, such as performance and the complexity level of the algorithm. This involves creating a grid on the hypothetical spaces of data sets and algorithms so as to identify a finite set of probability distributions from which the data sets are sampled and a finite set of algorithms. A given threshold metric acting on this grid will express the nearness (or statistical distance) of each algorithm and data set of interest to any given application. A technically difficult part of this project is to estimate the so-called metric entropy of a compact subset of functions of \textbf{infinitely many variables} that arise in the definition of these spaces.  more » « less
Award ID(s):
2012355
PAR ID:
10547035
Author(s) / Creator(s):
; ;
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
Applied Numerical Mathematics
Volume:
200
Issue:
C
ISSN:
0168-9274
Page Range / eLocation ID:
209 to 235
Subject(s) / Keyword(s):
Metric entropy Covering number Analytic functions Entire functions Entropy of class of functionals
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Graphs in metric spaces appear in a wide range of data sets, and there is a large body of work focused on comparing, matching, or analyzing collections of graphs in different ambient spaces. In this survey, we provide an overview of a diverse collection of distance measures that can be defined on the set of finite graphs immersed (and in some cases, embedded) in a metric space. For each of the distance measures, we recall their definitions and investigate which of the properties of a metric they satisfy. Furthermore we compare the distance measures based on these properties and discuss their computational complexity. 
    more » « less
  2. Given a family of sets (S1, S2,... SM) over a universe Ω, estimating the size of their union in the data streaming model is a fundamental computational problem with a wide variety of applications. The holy grail in the field of streaming is to seek design of algorithms that achieve (ε, δ)-approximation with poly(log |Ω|, ε-1, log δ-1) space and update time complexity. Earlier investigations achieve algorithms with desired space and update time complexity for restricted cases such as singletons (Distinct Elements problem), one-dimensional ranges, arithmetic progressions, and sub-cubes. However, techniques used in these works fail for many other simple structured sets. A prominent example is that of Klee's Measure Problem (KMP), wherein every set Si is represented by an axis-parallel rectangle in d-dimensional spaces. Despite extensive prior work, the best-known streaming algorithms for many of these cases depend on the size of the stream, and therefore the problem of whether there exists a streaming algorithm for estimations of size of the union of sets with poly(log |Ω|, ε-1, log δ-1) space and update time complexity has remained open. In this work, we focus on certain general families of sets called Delphic families (which allows efficient membership, sampling, and cardinality queries). Such families of sets capture several well-known problems, including KMP, test coverage, and hypervolume estimation. The primary contribution of our work is to resolve the above-mentioned open problem for streams over Delphic families. In particular, we design the first streaming algorithm for estimating |⋃i=1M Si| with poly(log |Ω|, ε-1, log δ-1) space and update time complexity (independent of M, the length of the stream) when each Si is a member from a Delphic family of sets. We further generalize our results to larger families of sets, called approximate-Delphic families, for which the size of a set can be known approximately but not exactly. Our results resolve two of the open problems listed in Meel, Vinodchandran, Chakraborty (PODS-21). 
    more » « less
  3. null (Ed.)
    SUMMARY We discuss the focusing inversion of potential field data for the recovery of sparse subsurface structures from surface measurement data on a uniform grid. For the uniform grid, the model sensitivity matrices have a block Toeplitz Toeplitz block structure for each block of columns related to a fixed depth layer of the subsurface. Then, all forward operations with the sensitivity matrix, or its transpose, are performed using the 2-D fast Fourier transform. Simulations are provided to show that the implementation of the focusing inversion algorithm using the fast Fourier transform is efficient, and that the algorithm can be realized on standard desktop computers with sufficient memory for storage of volumes up to size n ≈ 106. The linear systems of equations arising in the focusing inversion algorithm are solved using either Golub–Kahan bidiagonalization or randomized singular value decomposition algorithms. These two algorithms are contrasted for their efficiency when used to solve large-scale problems with respect to the sizes of the projected subspaces adopted for the solutions of the linear systems. The results confirm earlier studies that the randomized algorithms are to be preferred for the inversion of gravity data, and for data sets of size m it is sufficient to use projected spaces of size approximately m/8. For the inversion of magnetic data sets, we show that it is more efficient to use the Golub–Kahan bidiagonalization, and that it is again sufficient to use projected spaces of size approximately m/8. Simulations support the presented conclusions and are verified for the inversion of a magnetic data set obtained over the Wuskwatim Lake region in Manitoba, Canada. 
    more » « less
  4. null (Ed.)
    We design a Local Computation Algorithm (LCA) for the set cover problem. Given a set system where each set has size at most s and each element is contained in at most t sets, the algorithm reports whether a given set is in some fixed set cover whose expected size is O(log s) times the minimum fractional set cover value. Our algorithm requires sO(log s) tO(log s+log log t)) queries. This result improves upon the application of the reduction of [Parnas and Ron, TCS’07] on the result of [Kuhn et al., SODA’06], which leads to a query complexity of (st) O(log s · log t). To obtain this result, we design a parallel set cover algorithm that admits an efficient simulation in the LCA model by using a sparsification technique introduced in [Ghaffari and Uitto, SODA’19] for the maximal independent set problem. The parallel algorithm adds a random subset of the sets to the solution in a style similar to the PRAM algorithm of [Berger et al., FOCS’89]. However, our algorithm differs in the way that it never revokes its decisions, which results in a fewer number of adaptive rounds. This requires a novel approximation analysis which might be of independent interest. 
    more » « less
  5. We develop fast distribution-free conformal prediction algorithms for obtaining multivalid coverage on exchangeable data in the batch setting. Multivalid coverage guarantees are stronger than marginal coverage guarantees in two ways: (1) They hold even conditional on group membership---that is, the target coverage level holds conditionally on membership in each of an arbitrary (potentially intersecting) group in a finite collection of regions in the feature space. (2) They hold even conditional on the value of the threshold used to produce the prediction set on a given example. In fact multivalid coverage guarantees hold even when conditioning on group membership and threshold value simultaneously. We give two algorithms: both take as input an arbitrary non-conformity score and an arbitrary collection of possibly intersecting groups , and then can equip arbitrary black-box predictors with prediction sets. Our first algorithm is a direct extension of quantile regression, needs to solve only a single convex minimization problem, and produces an estimator which has group-conditional guarantees for each group in . Our second algorithm is iterative, and gives the full guarantees of multivalid conformal prediction: prediction sets that are valid conditionally both on group membership and non-conformity threshold. We evaluate the performance of both of our algorithms in an extensive set of experiments. 
    more » « less