â€ś[A]llain Gersten, Hopfen, und Wasserâ€ť â€” 1516 Reinheitsgebot We present Bavarian , a collection of samplingbased algorithms for approximating the Betweenness Centrality (BC) of all vertices in a graph. Our algorithms use MonteCarlo Empirical Rademacher Averages (MCERAs), a concept from statistical learning theory, to efficiently compute tight bounds on the maximum deviation of the estimates from the exact values. The MCERAs provide a sampledependent approximation guarantee much stronger than the stateoftheart, thanks to its use of varianceaware probabilistic tail bounds. The flexibility of the MCERAs allows us to introduce a unifying framework that can be instantiated with existing samplingbased estimators of BC, thus allowing a fair comparison between them, decoupled from the samplecomplexity results with which they were originally introduced. Additionally, we prove novel samplecomplexity results showing that, for all estimators, the sample size sufficient to achieve a desired approximation guarantee depends on the vertexdiameter of the graph, an easytobound characteristic quantity. We also show progressivesampling algorithms and extensions to other centrality measures, such as percolation centrality. Our extensive experimental evaluation of Bavarian shows the improvement over the stateoftheart made possible by the MCERAs (2â€“4Ă— reduction in the error bound), and it allows us to assess the different tradeoffs between sample size and accuracy guarantees offered by the different estimators.
more »
« less
This content will become publicly available on July 3, 2024
Sample Complexity of Probability Divergences under Group Symmetry
We rigorously quantify the improvement in the sample complexity of variational divergence estimations for groupinvariant distributions. In the cases of the Wasserstein1 metric and the Lipschitzregularized đť›Ľ
divergences, the reduction of sample complexity is proportional to an ambientdimensiondependent power of the group size. For the maximum mean discrepancy (MMD), the improvement of sample complexity is more nuanced, as it depends on not only the group size but also the choice of kernel. Numerical simulations verify our theories.
more »
« less
 Award ID(s):
 2008970
 NSFPAR ID:
 10486979
 Publisher / Repository:
 Proceedings of Machine Learning Research, ML Research press
 Date Published:
 Journal Name:
 Proceedings of Machine Learning Research
 ISSN:
 26403498
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)This paper revisits the kmismatch shortest unique substring finding problem and demonstrates that a technique recently presented in the context of solving the kmismatch average common substring problem can be adapted and combined with parts of the existing solution, resulting in a new algorithm which has expected time complexity of O(n log^k n), while maintaining a practical space complexity at O(kn), where n is the string length. When , which is the hard case, our new proposal significantly improves the anycase O(n^2) time complexity of the prior best method for kmismatch shortest unique substring finding. Experimental study shows that our new algorithm is practical to implement and demonstrates significant improvements in processing time compared to the prior best solution's implementation when k is small relative to n. For example, our method processes a 200 KB sample DNA sequence with k=1 in just 0.18 seconds compared to 174.37 seconds with the prior best solution. Further, it is observed that significant portions of the adapted technique can be executed in parallel, using two different simple concurrency models, resulting in further significant practical performance improvement. As an example, when using 8 cores, the parallel implementations both achieved processing times that are less than 1/4 of the serial implementation's time cost, when processing a 10 MB sample DNA sequence with k=2. In an age where instances with thousands of gigabytes of RAM are readily available for use through Cloud infrastructure providers, it is likely that the tradeoff of additional memory usage for significantly improved processing times will be desirable and needed by many users. For example, the best prior solution may spend years to finish a DNA sample of 200MB for any , while this new proposal, using 24 cores, can finish processing a sample of this size with k=1 in 206.376 seconds with a peak memory usage of 46 GB, which is both easily available and affordable on Cloud. It is expected that this new efficient and practical algorithm for kmismatch shortest unique substring finding will prove useful to those using the measure on long sequences in fields such as computational biology. We also give a theoretical bound that the kmismatch shortest unique substring finding problem can be solved using O(n log^k n) time and O(n) space, asymptotically much better than the one we implemented, serving as a new discovery of interest.more » « less

Many supervised learning problems involve highdimensional data such as images, text, or graphs. In order to make efficient use of data, it is often useful to leverage certain geometric priors in the problem at hand, such as invariance to translations, permutation subgroups, or stability to small deformations. We study the sample complexity of learning problems where the target function presents such invariance and stability properties, by considering spherical harmonic decompositions of such functions on the sphere. We provide nonparametric rates of convergence for kernel methods, and show improvements in sample complexity by a factor equal to the size of the group when using an invariant kernel over the group, compared to the corresponding noninvariant kernel. These improvements are valid when the sample size is large enough, with an asymptotic behavior that depends on spectral properties of the group. Finally, these gains are extended beyond invariance groups to also cover geometric stability to small deformations, modeled here as subsets (not necessarily subgroups) of permutations.more » « less

null (Ed.)Many supervised learning problems involve highdimensional data such as images, text, or graphs. In order to make efficient use of data, it is often useful to leverage certain geometric priors in the problem at hand, such as invariance to translations, permutation subgroups, or stability to small deformations. We study the sample complexity of learning problems where the target function presents such invariance and stability properties, by considering spherical harmonic decompositions of such functions on the sphere. We provide nonparametric rates of convergence for kernel methods, and show improvements in sample complexity by a factor equal to the size of the group when using an invariant kernel over the group, compared to the corresponding noninvariant kernel. These improvements are valid when the sample size is large enough, with an asymptotic behavior that depends on spectral properties of the group. Finally, these gains are extended beyond invariance groups to also cover geometric stability to small deformations, modeled here as subsets (not necessarily subgroups) of permutations.more » « less

Inferencebased optimization via simulation, which substitutes Gaussian process (GP) learning for the structural properties exploited in mathematical programming, is a powerful paradigm that has been shown to be remarkably effective in problems of modest feasibleregion size and decisionvariable dimension. The limitation to â€śmodestâ€ť problems is a result of the computational overhead and numerical challenges encountered in computing the GP conditional (posterior) distribution on each iteration. In this paper, we substantially expand the size of discretedecisionvariable optimizationviasimulation problems that can be attacked in this way by exploiting a particular GPâ€”discrete Gaussian Markov random fieldsâ€”and carefully tailored computational methods. The result is the rapid Gaussian Markov Improvement Algorithm (rGMIA), an algorithm that delivers both a global convergence guarantee and finitesample optimalitygap inference for significantly larger problems. Between infrequent evaluations of the global conditional distribution, rGMIA applies the full power of GP learning to rapidly search smaller sets of promising feasible solutions that need not be spatially close. We carefully document the computational savings via complexity analysis and an extensive empirical study. Summary of Contribution: The broad topic of the paper is optimization via simulation, which means optimizing some performance measure of a system that may only be estimated by executing a stochastic, discreteevent simulation. Stochastic simulation is a core topic and method of operations research. The focus of this paper is on significantly speedingup the computations underlying an existing method that is based on Gaussian process learning, where the underlying Gaussian process is a discrete Gaussian Markov Random Field. This speedup is accomplished by employing smart computational linear algebra, stateoftheart algorithms, and a careful divideandconquer evaluation strategy. Problems of significantly greater size than any other existing algorithm with similar guarantees can solve are solved as illustrations.more » « less