skip to main content


Title: ONe Index for All Kernels (ONIAK): A Zero Re-Indexing LSH Solution to ANNS-ALT (After Linear Transformation)
In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNS-ALT, we search for the vector (in a dataset) that, after being linearly transformed by a user-specified query matrix, is closest to a query vector. It is a very general mother problem in the sense that a wide range of baby ANNS problems that have important applications in databases and machine learning can be reduced to and solved as ANNS-ALT, or its dual that we call ANNS-ALTD. We propose a novel and computationally efficient solution, called ONe Index for All Kernels (ONIAK), to ANNS-ALT and all its baby problems when the data dimension d is not too large (say d ≤ 200). In ONIAK, a universal index is built, once and for all, for answering all future ANNS-ALT queries that can have distinct query matrices. We show by experiments that, when d is not too large, ONIAK has better query performance than linear scan on the mother problem (of ANNS-ALT), and has query performances comparable to those of the state-of-the-art solutions on the baby problems. However, the algorithmic technique behind this universal index approach suffers from a so-called dimension blowup problem that can make the indexing time prohibitively long for a large dataset. We propose a novel algorithmic technique, called fast GOE quadratic form (FGoeQF), that completely solves the (prohibitively long indexing time) fallout of the dimension blowup problem. We also propose a Johnson-Lindenstrauss transform (JLT) based ANNS-ALT (and ANNS-ALTD) solution that significantly outperforms any competitor when d is large.  more » « less
Award ID(s):
2051800
NSF-PAR ID:
10400217
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
15
Issue:
13
ISSN:
2150-8097
Page Range / eLocation ID:
3937 to 3949
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this work, we formulate and solve a new type of approximate nearest neighbor search (ANNS) problems called ANNS after linear transformation (ALT). In ANNS-ALT, we search for the vector (in a dataset) that, after being linearly transformed by a user-specified query matrix, is closest to a query vector. It is a very general mother problem in the sense that a wide range of baby ANNS problems that have important applications in databases and machine learning can be reduced to and solved as ANNS-ALT, or its dual that we call ANNS-ALTD. We propose a novel and computationally efficient solution, called ONe Index for All Kernels (ONIAK), to ANNS-ALT and all its baby problems when the data dimension đť‘‘ is not too large (say đť‘‘ ≤ 200). In ONIAK, a universal index is built, once and for all, for answering all future ANNS-ALT queries that can have distinct query matrices. We show by experiments that, when đť‘‘ is not too large, ONIAK has better query performance than linear scan on the mother problem (of ANNS-ALT), and has query performances comparable to those of the state-of-the-art solutions on the baby problems. However, the algorithmic technique behind this universal index approach suffers from a so-called dimension blowup problem that can make the indexing time prohibitively long for a large dataset. We propose a novel algorithmic technique, called fast GOE quadratic form (FGoeQF), that completely solves the (prohibitively long indexing time) fallout of the dimension blowup problem. We also propose a Johnson-Lindenstrauss transform (JLT) based ANNS- ALT (and ANNS-ALTD) solution that significantly outperforms any competitor when đť‘‘ is large. 
    more » « less
  2. Given a collection of vectors, the approximate K-nearest-neighbor graph (KGraph for short) connects every vector to its approximate K-nearest-neighbors (KNN for short). KGraph plays an important role in high dimensional data visualization, semantic search, manifold learning, and machine learning. The vectors are typically vector representations of real-world objects (e.g., images and documents), which often come with a few structured attributes, such as times-tamps and locations. In this paper, we study the all-range approximate K-nearest-neighbor graph (ARKGraph) problem. Specifically, given a collection of vectors, each associated with a numerical search key (e.g., a timestamp), we aim to build an index that takes a search key range as the query and returns the KGraph of vectors whose search keys are within the query range. ARKGraph can facilitate interactive high dimensional data visualization, data mining, etc. A key challenge of this problem is the huge index size. This is because, given n vectors, a brute-force index stores a KGraph for every search key range, which results in O (K n 3 ) index size as there are O ( n 2 ) search key ranges and each KGraph takes O (K n ) space. We observe that the KNN of a vector in nearby ranges are often the same, which can be grouped together to save space. Based on this observation, we propose a series of novel techniques that reduce the index size significantly to just O (K n log n ) in the average case. Furthermore, we develop an efficient indexing algorithm that constructs the optimized ARKGraph index directly without exhaustively calculating the distance between every pair of vectors. To process a query, for each vector in the query range, we only need O (log log n + K log K) to restore its KNN in the query range from the optimized ARKGraph index. We conducted extensive experiments on real-world datasets. Experimental results show that our optimized ARKGraph index achieved a small index size, low query latency, and good scalability. Specifically, our approach was 1000x faster than the baseline method that builds a KGraph for all the vectors in the query range on-the-fly. 
    more » « less
  3. In recent years several compressed indexes based on variants of the Burrows-Wheeler transformation have been introduced. Some of these are used to index structures far more complex than a single string, as was originally done with the FM-index [Ferragina and Manzini, J. ACM 2005]. As such, there has been an increasing effort to better understand under which conditions such an indexing scheme is possible. This has led to the introduction of Wheeler graphs [Gagie et al., Theor. Comput. Sci., 2017]. Gagie et al. showed that de Bruijn graphs, generalized compressed suffix arrays, and several other BWT related structures can be represented as Wheeler graphs and that Wheeler graphs can be indexed in a way which is space-efficient. Hence, being able to recognize whether a given graph is a Wheeler graph, or being able to approximate a given graph by a Wheeler graph, could have numerous applications in indexing. Here we resolve the open question of whether there exists an efficient algorithm for recognizing if a given graph is a Wheeler graph. We present - The problem of recognizing whether a given graph G=(V,E) is a Wheeler graph is NP-complete for any edge label alphabet of size sigma >= 2, even when G is a DAG. This holds even on a restricted, subset of graphs called d-NFA's for d >= 5. This is in contrast to recent results demonstrating the problem can be solved in polynomial time for d-NFA's where d <= 2. We also show the recognition problem can be solved in linear time for sigma =1; - There exists an 2^{e log sigma + O(n + e)} time exact algorithm where n = |V| and e = |E|. This algorithm relies on graph isomorphism being computable in strictly sub-exponential time; - We define an optimization variant of the problem called Wheeler Graph Violation, abbreviated WGV, where the aim is to remove the minimum number of edges in order to obtain a Wheeler graph. We show WGV is APX-hard, even when G is a DAG, implying there exists a constant C >= 1 for which there is no C-approximation algorithm (unless P = NP). Also, conditioned on the Unique Games Conjecture, for all C >= 1, it is NP-hard to find a C-approximation; - We define the Wheeler Subgraph problem, abbreviated WS, where the aim is to find the largest subgraph which is a Wheeler Graph (the dual of the WGV). In contrast to WGV, we prove that the WS problem is in APX for sigma=O(1); The above findings suggest that most problems under this theme are computationally difficult. However, we identify a class of graphs for which the recognition problem is polynomial-time solvable, raising the open question of which parameters determine this problem's difficulty. 
    more » « less
  4. The matrix completion problem seeks to recover a $d\times d$ ground truth matrix of low rank $r\ll d$ from observations of its individual elements. Real-world matrix completion is often a huge-scale optimization problem, with $d$ so large that even the simplest full-dimension vector operations with $O(d)$ time complexity become prohibitively expensive. Stochastic gradient descent (SGD) is one of the few algorithms capable of solving matrix completion on a huge scale, and can also naturally handle streaming data over an evolving ground truth. Unfortunately, SGD experiences a dramatic slow-down when the underlying ground truth is ill-conditioned; it requires at least $O(\kappa\log(1/\epsilon))$ iterations to get $\epsilon$-close to ground truth matrix with condition number $\kappa$. In this paper, we propose a preconditioned version of SGD that preserves all the favorable practical qualities of SGD for huge-scale online optimization while also making it agnostic to $\kappa$. For a symmetric ground truth and the Root Mean Square Error (RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$-accuracy in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence rate as if the ground truth were perfectly conditioned with $\kappa=1$. In our numerical experiments, we observe a similar acceleration for ill-conditioned matrix completion under the 1-bit cross-entropy loss, as well as pairwise losses such as the Bayesian Personalized Ranking (BPR) loss. 
    more » « less
  5. BACKGROUND Landau’s Fermi liquid theory provides the bedrock on which our understanding of metals has developed over the past 65 years. Its basic premise is that the electrons transporting a current can be treated as “quasiparticles”—electron-like particles whose effective mass has been modified, typically through interactions with the atomic lattice and/or other electrons. For a long time, it seemed as though Landau’s theory could account for all the many-body interactions that exist inside a metal, even in the so-called heavy fermion systems whose quasiparticle mass can be up to three orders of magnitude heavier than the electron’s mass. Fermi liquid theory also lay the foundation for the first successful microscopic theory of superconductivity. In the past few decades, a number of new metallic systems have been discovered that violate this paradigm. The violation is most evident in the way that the electrical resistivity changes with temperature or magnetic field. In normal metals in which electrons are the charge carriers, the resistivity increases with increasing temperature but saturates, both at low temperatures (because the quantized lattice vibrations are frozen out) and at high temperatures (because the electron mean free path dips below the smallest scattering pathway defined by the lattice spacing). In “strange metals,” by contrast, no saturation occurs, implying that the quasiparticle description breaks down and electrons are no longer the primary charge carriers. When the particle picture breaks down, no local entity carries the current. ADVANCES A new classification of metallicity is not a purely academic exercise, however, as strange metals tend to be the high-temperature phase of some of the best superconductors available. Understanding high-temperature superconductivity stands as a grand challenge because its resolution is fundamentally rooted in the physics of strong interactions, a regime where electrons no longer move independently. Precisely what new emergent phenomena one obtains from the interactions that drive the electron dynamics above the temperature where they superconduct is one of the most urgent problems in physics, attracting the attention of condensed matter physicists as well as string theorists. One thing is clear in this regime: The particle picture breaks down. As particles and locality are typically related, the strange metal raises the distinct possibility that its resolution must abandon the basic building blocks of quantum theory. We review the experimental and theoretical studies that have shaped our current understanding of the emergent strongly interacting physics realized in a host of strange metals, with a special focus on their poster-child: the copper oxide high-temperature superconductors. Experiments are highlighted that attempt to link the phenomenon of nonsaturating resistivity to parameter-free universal physics. A key experimental observation in such materials is that removing a single electron affects the spectrum at all energy scales, not just the low-energy sector as in a Fermi liquid. It is observations of this sort that reinforce the breakdown of the single-particle concept. On the theoretical side, the modern accounts that borrow from the conjecture that strongly interacting physics is really about gravity are discussed extensively, as they have been the most successful thus far in describing the range of physics displayed by strange metals. The foray into gravity models is not just a pipe dream because in such constructions, no particle interpretation is given to the charge density. As the breakdown of the independent-particle picture is central to the strange metal, the gravity constructions are a natural tool to make progress on this problem. Possible experimental tests of this conjecture are also outlined. OUTLOOK As more strange metals emerge and their physical properties come under the scrutiny of the vast array of experimental probes now at our disposal, their mysteries will be revealed and their commonalities and differences cataloged. In so doing, we should be able to understand the universality of strange metal physics. At the same time, the anomalous nature of their superconducting state will become apparent, offering us hope that a new paradigm of pairing of non-quasiparticles will also be formalized. The correlation between the strength of the linear-in-temperature resistivity in cuprate strange metals and their corresponding superfluid density, as revealed here, certainly hints at a fundamental link between the nature of strange metallicity and superconductivity in the cuprates. And as the gravity-inspired theories mature and overcome the challenge of projecting their powerful mathematical machinery onto the appropriate crystallographic lattice, so too will we hope to build with confidence a complete theory of strange metals as they emerge from the horizon of a black hole. Curved spacetime with a black hole in its interior and the strange metal arising on the boundary. This picture is based on the string theory gauge-gravity duality conjecture by J. Maldacena, which states that some strongly interacting quantum mechanical systems can be studied by replacing them with classical gravity in a spacetime in one higher dimension. The conjecture was made possible by thinking about some of the fundamental components of string theory, namely D-branes (the horseshoe-shaped object terminating on a flat surface in the interior of the spacetime). A key surprise of this conjecture is that aspects of condensed matter systems in which the electrons interact strongly—such as strange metals—can be studied using gravity. 
    more » « less