Title: On Efficient Range-Summability of IID Random Variables in Two or Higher Dimensions
d-dimensional (for d > 1) efficient range-summability (dD-ERS) of random variables (RVs) is a fundamental algorithmic problem that has applications to two important families of database problems, namely, fast approximate wavelet tracking (FAWT) on data streams and approximately answering range-sum queries over a data cube. Whether there are efficient solutions to the dD-ERS problem, or to the latter database problem, have been two long-standing open problems. Both are solved in this work. Specifically, we propose a novel solution framework to dD-ERS on RVs that have Gaussian or Poisson distribution. Our dD-ERS solutions are the first ones that have polylogarithmic time complexities. Furthermore, we develop a novel k-wise independence theory that allows our dD-ERS solutions to have both high computational efficiencies and strong provable independence guarantees. Finally, we show that under a sufficient and likely necessary condition, certain existing solutions for 1D-ERS can be generalized to higher dimensions. more »« less
Jingfan Meng, Huayi Wang
(, 25th International Conference on Database Theory (ICDT 2022))
Dan Olteanu and Nils Vortmeier
(Ed.)
Efficient range-summability (ERS) of a long list of random variables is a fundamental algorithmic problem that has applications to three important database applications, namely, data stream processing, space-efficient histogram maintenance (SEHM), and approximate nearest neighbor searches (ANNS). In this work, we propose a novel dyadic simulation framework and develop three novel ERS solutions, namely Gaussian-dyadic simulation tree (DST), Cauchy-DST and Random Walk-DST, using it. We also propose novel rejection sampling techniques to make these solutions computationally efficient. Furthermore, we develop a novel k-wise independence theory that allows our ERS solutions to have both high computational efficiencies and strong provable independence guarantees.
Meng, Jingfan; Wang, Huayi; Xu, Jun; Ogihara, Mitsunori
(, Proceedings of the VLDB Endowment)
In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNS-ALT, we search for the vector (in a dataset) that, after being linearly transformed by a user-specified query matrix, is closest to a query vector. It is a very general mother problem in the sense that a wide range of baby ANNS problems that have important applications in databases and machine learning can be reduced to and solved as ANNS-ALT, or its dual that we call ANNS-ALTD. We propose a novel and computationally efficient solution, called ONe Index for All Kernels (ONIAK), to ANNS-ALT and all its baby problems when the data dimension d is not too large (say d ≤ 200). In ONIAK, a universal index is built, once and for all, for answering all future ANNS-ALT queries that can have distinct query matrices. We show by experiments that, when d is not too large, ONIAK has better query performance than linear scan on the mother problem (of ANNS-ALT), and has query performances comparable to those of the state-of-the-art solutions on the baby problems. However, the algorithmic technique behind this universal index approach suffers from a so-called dimension blowup problem that can make the indexing time prohibitively long for a large dataset. We propose a novel algorithmic technique, called fast GOE quadratic form (FGoeQF), that completely solves the (prohibitively long indexing time) fallout of the dimension blowup problem. We also propose a Johnson-Lindenstrauss transform (JLT) based ANNS-ALT (and ANNS-ALTD) solution that significantly outperforms any competitor when d is large.
Lee, Dongjin; Rahman, Sharif
(, International Journal for Uncertainty Quantification)
Newly restructured generalized polynomial chaos expansion (GPCE) methods for high-dimensional design optimization in the presence of input random variables with arbitrary, dependent probability distributions are reported. The methods feature a dimensionally decomposed GPCE (DD-GPCE) for statistical moment and reliability analyses associated with a high-dimensional stochastic response; a novel synthesis between the DD-GPCE approximation and score functions for estimating the first-order design sensitivities of the statistical moments and failure probability; and a standard gradient-based optimization algorithm, constructing the single-step DD-GPCE and multipoint single-step DD-GPCE (MPSS-DD-GPCE) methods. In these new design methods, the multivariate orthonormal basis functions are assembled consistent with the chosen degree of interaction between input variables and the polynomial order, thus facilitating to deflate the curse of dimensionality to the extent possible. In addition, when coupled with score functions, the DD-GPCE approximation leads to analytical formulae for calculating the design sensitivities. More importantly, the statistical moments, failure probability, and their design sensitivities are determined concurrently from a single stochastic analysis or simulation. Numerical results affirm that the proposed methods yield accurate and computationally efficient optimal solutions of mathematical problems and design solutions for simple mechanical systems. Finally, the success in conducting stochastic shape optimization of a bogie side frame with 41 random variables demonstrates the power of the MPSS-DD-GPCE method in solving industrial-scale engineering design problems.
Esmailpour, Aryan; Glavic, Boris; Hu, Xiao; Sintos, Stavros
(, Proceedings of the ACM on Management of Data)
Given a self-join-free conjunctive queryQand a set of tuplesS, asynthetic witness Dis a database instance such that the result ofQonDisS. In this work, we are interested in two problems. First, the existence problem ESW decides whether any synthetic witnessDexists. Second, given that a synthetic witness exists, the minimization problem SSW computes a synthetic witness of minimal size. The SSW problem is related to thesmallest witness problemrecently studied by Hu and Sintos [22]; however, the objective and the results are inherently different. More specifically, we show that SSW is poly-time solvable for a wider range of queries. Interestingly, in some cases, SSW is related to optimization problems in other domains, such as therole miningproblem in data mining and theedge concentrationproblem in graph drawing. Solutions to ESW and SSW are of practical interest, e.g., fortest database generationfor applications accessing a database and fordata compressionby encoding a datasetSas a pair of a queryQand databaseD. We prove that ESW is in P, presenting a simple algorithm that, given anyS, decides whether a synthetic witness exists in polynomial time in the size ofS. Next, we focus on the SSW problem. We show an algorithm that computes a minimal synthetic witness in polynomial time with respect to the size ofSfor any queryQthat has thehead-dominationproperty. IfQdoes not have such a property, then SSW is generally hard. More specifically, we show that for the class ofpath queries(of any constant length), SSW cannot be solved in polynomial time unless P = NP. We then extend this hardness result to the class ofBerge-acyclicqueries that do not have the head-domination property, obtaining a full dichotomy of SSW for Berge-acyclic queries. Finally, we investigate the hardness of SSW beyond Berge-acyclic queries by showing that SSW cannot be solved in polynomial time for some cyclic queries unless P = NP.
Under the linear regression framework, we study the variable selection problem when the underlying model is assumed to have a small number of nonzero coefficients. Non-convex penalties in speci c forms are well-studied in the literature for sparse estimation. A recent work, Ahn, Pang, and Xin (2017), has pointed out that nearly all existing non-convex penalties can be represented as difference-of-convex (DC) functions, which are the difference of two convex functions, while itself may not be convex. There is a large existing literature on optimization problems when their objectives and/or constraints involve DC functions. Efficient numerical solutions have been proposed. Under the DC framework, directional-stationary (d-stationary) solutions are considered, and they are usually not unique. In this paper, we show that under some mild conditions, a certain subset of d-stationary solutions in an optimization problem (with a DC objective) has some ideal statistical properties: namely, asymptotic estimation consistency, asymptotic model selection consistency, asymptotic efficiency. Our assumptions are either weaker than or comparable with those conditions that have been adopted in other existing works. This work shows that DC is a nice framework to offer a uni ed approach to these existing works where non-convex penalties are involved. Our work bridges the communities of optimization and statistics.
Jingfan Meng, Huayi Wang. On Efficient Range-Summability of IID Random Variables in Two or Higher Dimensions. Retrieved from https://par.nsf.gov/biblio/10378567. 26th International Conference on Database Theory (ICDT 2023) .
Jingfan Meng, Huayi Wang. On Efficient Range-Summability of IID Random Variables in Two or Higher Dimensions. 26th International Conference on Database Theory (ICDT 2023), (). Retrieved from https://par.nsf.gov/biblio/10378567.
Jingfan Meng, Huayi Wang.
"On Efficient Range-Summability of IID Random Variables in Two or Higher Dimensions". 26th International Conference on Database Theory (ICDT 2023) (). Country unknown/Code not available. https://par.nsf.gov/biblio/10378567.
@article{osti_10378567,
place = {Country unknown/Code not available},
title = {On Efficient Range-Summability of IID Random Variables in Two or Higher Dimensions},
url = {https://par.nsf.gov/biblio/10378567},
abstractNote = {d-dimensional (for d > 1) efficient range-summability (dD-ERS) of random variables (RVs) is a fundamental algorithmic problem that has applications to two important families of database problems, namely, fast approximate wavelet tracking (FAWT) on data streams and approximately answering range-sum queries over a data cube. Whether there are efficient solutions to the dD-ERS problem, or to the latter database problem, have been two long-standing open problems. Both are solved in this work. Specifically, we propose a novel solution framework to dD-ERS on RVs that have Gaussian or Poisson distribution. Our dD-ERS solutions are the first ones that have polylogarithmic time complexities. Furthermore, we develop a novel k-wise independence theory that allows our dD-ERS solutions to have both high computational efficiencies and strong provable independence guarantees. Finally, we show that under a sufficient and likely necessary condition, certain existing solutions for 1D-ERS can be generalized to higher dimensions.},
journal = {26th International Conference on Database Theory (ICDT 2023)},
author = {Jingfan Meng, Huayi Wang},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.