skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Compressing Inverted Indexes with Recursive Graph Bisection: A Reproducibility Study
Document reordering is an important but often overlooked preprocessing stage in index construction. Reordering document identifiers in graphs and inverted indexes has been shown to reduce storage costs and improve processing efficiency in the resulting indexes. However, surprisingly few document reordering algorithms are publicly available despite their importance. A new reordering algorithm derived from recursive graph bisection was recently proposed by Dhulipala et al., and shown to be highly effective and efficient when compared against other state-of-the-art reordering strategies. In this work, we present a reproducibility study of this new algorithm. We describe the implementation challenges encountered, and explore the performance characteristics of our clean-room reimplementation. We show that we are able to successfully reproduce the core results of the original paper, and show that the algorithm generalizes to other collections and indexing frameworks. Furthermore, we make our implementation publicly available to help promote further research in this space.  more » « less
Award ID(s):
1718680
PAR ID:
10171634
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
European Conference on Information Retrieval
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Aichholzer, Oswin; Wang, Haitao (Ed.)
    We show that a variant of the continuous Fréchet distance between polygonal curves can be computed using essentially the same algorithm used to solve the discrete version. The new variant is not necessarily monotone, but this shortcoming can be easily handled via refinement. Combined with a Dijkstra/Prim type algorithm, this leads to a realization of the Fréchet distance (i.e., a morphing) that is locally optimal (aka locally correct), that is both easy to compute, and in practice, takes near linear time on many inputs. The new morphing has the property that the leash is always as short as possible. These matchings/morphings are more natural, and are better than the ones computed by standard algorithms - in particular, they handle noise more graciously. This should make the Fréchet distance more useful for real world applications. We implemented the new algorithm, and various strategies to obtain fast practical performance. We performed extensive experiments with our new algorithm, and released publicly available (and easily installable and usable) Julia and Python packages. In particular, the Julia implementation, for computing the regular Fréchet distance, seems to be {significantly faster} than other currently available implementations. See Table 2.2. Our algorithms can be used to compute the almost-exact Fréchet distance between polygonal curves. Implementations and numerous examples are available here: https://frechet.xyz. 
    more » « less
  2. Abstract MotivationIn the past few years, researchers have proposed numerous indexing schemes for searching large datasets of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes—Mantis, VariMerge and Bifrost—that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data. ResultsIn this paper, we show how to build a scalable and updatable exact raw sequence-search index. Specifically, we extend Mantis using the Bentley–Saxe transformation to support efficient updates, called Dynamic Mantis. We demonstrate Dynamic Mantis’s scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples. Compared to VariMerge and Bifrost, Dynamic Mantis is more efficient in terms of index-construction time and memory, query time and memory and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Dynamic Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Dynamic Mantis indexes were about 2.5× smaller than Bifrost’s indexes and about half as big as VariMerge’s indexes. Availability and implementationDynamic Mantis implementation is available at https://github.com/splatlab/mantis/tree/mergeMSTs. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract The big graph database provides strong modeling capabilities and efficient querying for complex applications. Subgraph isomorphism which finds exact matches of a query graph in the database efficiently, is a challenging problem. Current subgraph isomorphism approaches mostly are based on the pruning strategy proposed by Ullmann. These techniques have two significant drawbacks- first, they are unable to efficiently handle complex queries, and second, their implementations need the large indexes that require large memory resources. In this paper, we describe a new subgraph isomorphism approach, the HyGraph algorithm, that is efficient both in querying and with memory requirements for index creation. We compare the HyGraph algorithm with two popular existing approaches, GraphQL and Cypher using complexity measures and experimentally using three big graph data sets—(1) a country-level population database, (2) a simulated bank database, and (3) a publicly available World Cup big graph database. It is shown that the HyGraph solution performs significantly better (or equally) than competing algorithms for the query operations on these big databases, making it an excellent candidate for subgraph isomorphism queries in real scenarios. 
    more » « less
  4. The singular value decomposition (SVD) of a reordering of a matrix A can be used to determine an efficient Kronecker product (KP) sum approximation to A. We present the use of an approximate truncated SVD (TSVD) to find the KP approximation, and contrast using a randomized singular value decomposition algorithm (RSVD), a new enlarged Golub Kahan Bidiagonalization algorithm (EGKB) and the exact TSVD. The EGKB algorithm enlarges the Krylov subspace beyond a given rank for the desired approximation. A suitable rank is determined using an automatic stopping test. We also contrast the use of single and double precision arithmetic to find the approximate TSVDs. To illustrate the accuracy and efficiency in terms of memory and computational cost of these approximate KPs, we consider the solution of the total variation regularized image deblurring problem using the split Bregman algorithm implemented in double precision. Together with an efficient implementation for the reordering of A we demonstrate that the approximate KP sum can be obtained using a TSVD, and that the new EGKB algorithm contrasts favorably with the use of the RSVD. These results verify that it is feasible to use single precision when estimating a KP sum from an approximate TSVD. 
    more » « less
  5. Abstract Maize inflorescence is a complex phenotype that involves the physical and developmental interplay of multiple traits. Given the evidence that genes could pleiotropically contribute to several of these traits, we used publicly available maize data to assess the ability of multivariate genome-wide association study (GWAS) approaches to identify pleiotropic quantitative trait loci (pQTL). Our analysis of 23 publicly available inflorescence and leaf-related traits in a diversity panel of n = 281 maize lines genotyped with 376,336 markers revealed that the two multivariate GWAS approaches we tested were capable of identifying pQTL in genomic regions coinciding with similar associations found in previous studies. We then conducted a parallel simulation study on the same individuals, where it was shown that multivariate GWAS approaches yielded a higher true-positive quantitative trait nucleotide (QTN) detection rate than comparable univariate approaches for all evaluated simulation settings except for when the correlated simulated traits had a heritability of 0.9. We therefore conclude that the implementation of state-of-the-art multivariate GWAS approaches is a useful tool for dissecting pleiotropy and their more widespread implementation could facilitate the discovery of genes and other biological mechanisms underlying maize inflorescence. 
    more » « less