skip to main content


Title: Reverse spatial top-k keyword queries
Abstract

We introduce theReverseSpatial Top-kKeyword (RSK)query, which is defined as:given a query term q, an integer k and a neighborhood size find all the neighborhoods of that size where q is in the top-k most frequent terms among the social posts in those neighborhoods. An obvious approach would be to partition the dataset with a uniform grid structure of a given cell size and identify the cells where this term is in the top-k most frequent keywords. However, this answer would be incomplete since it only checks for neighborhoods that are perfectly aligned with the grid. Furthermore, for every neighborhood (square) that is an answer, we can define infinitely more result neighborhoods by minimally shifting the square without including more posts in it. To address that, we need to identify contiguous regions where any point in the region can be the center of a neighborhood that satisfies the query. We propose an algorithm to efficiently answer an RSK query using an index structure consisting of a uniform grid augmented by materialized lists of term frequencies. We apply various optimizations that drastically improve query latency against baseline approaches. We also provide a theoretical model to choose the optimal cell size for the index to minimize query latency. We further examine a restricted version of the problem (RSKR) that limits the scope of the answer and propose efficientapproximatealgorithms. Finally, we examine how parallelism can improve performance by balancing the workload using a smartload slicingtechnique. Extensive experimental performance evaluation of the proposed methods using real Twitter datasets and crime report datasets, shows the efficiency of our optimizations and the accuracy of the proposed theoretical model.

 
more » « less
Award ID(s):
1831615 1838222 1954644
NSF-PAR ID:
10374636
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
The VLDB Journal
ISSN:
1066-8888
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Motivated by an attempt to understand the formation and development of (human) language, we introduce a "distributed compression" problem. In our problem a sequence of pairs of players from a set of K players are chosen and tasked to communicate messages drawn from an unknown distribution Q. Arguably languages are created and evolve to compress frequently occurring messages, and we focus on this aspect. The only knowledge that players have about the distribution Q is from previously drawn samples, but these samples differ from player to player. The only common knowledge between the players is restricted to a common prior distribution P and some constant number of bits of information (such as a learning algorithm). Letting T_eps denote the number of iterations it would take for a typical player to obtain an eps-approximation to Q in total variation distance, we ask whether T_eps iterations suffice to compress the messages down roughly to their entropy and give a partial positive answer. We show that a natural uniform algorithm can compress the communication down to an average cost per message of O(H(Q) + log (D(P || Q) + O(1)) in $\tilde{O}(T_eps)$ iterations while allowing for O(eps)-error, where D(. || .) denotes the KL-divergence between distributions. For large divergences this compares favorably with the static algorithm that ignores all samples and compresses down to H(Q) + D(P || Q) bits, while not requiring (T_eps . K) iterations that it would take players to develop optimal but separate compressions for each pair of players. Along the way we introduce a "data-structural" view of the task of communicating with a natural language and show that our natural algorithm can also be implemented by an efficient data structure, whose storage is comparable to the storage requirements of Q and whose query complexity is comparable to the lengths of the message to be compressed. Our results give a plausible mathematical analogy to the mechanisms by which human languages get created and evolve, and in particular highlights the possibility of coordination towards a joint task (agreeing on a language) while engaging in distributed learning. 
    more » « less
  2. Ever-increasing demands for energy, particularly being environmentally friendly have promoted the transition from fossil fuels to renewable energy.1Lithium-ion batteries (LIBs), arguably the most well-studied energy storage system, have dominated the energy market since their advent in the 1990s.2However, challenging issues regarding safety, supply of lithium, and high price of lithium resources limit the further advancement of LIBs for large-scale energy storage applications.3Therefore, attention is being concentrated on an alternative electrochemical energy storage device that features high safety, low cost, and long cycle life. Rechargeable aqueous zinc-ion batteries (ZIBs) is considered one of the most promising alternative energy storage systems due to the high theoretical energy and power densities where the multiple electrons (Zn2+) . In addition, aqueous ZIBs are safer due to non-flammable electrolyte (e.g., typically aqueous solution) and can be manufactured since they can be assembled in ambient air conditions.4As an essential component in aqueous Zn-based batteries, the Zn metal anode generally suffers from the growth of dendrites, which would affect battery performance in several ways. Second, the led by the loose structure of Zn dendrite may reduce the coulombic efficiency and shorten the battery lifespan.5

    Several approaches were suggested to improve the electrochemical stability of ZIBs, such as implementing an interfacial buffer layer that separates the active Zn from the bulk electrolyte.6However, the and thick thickness of the conventional Zn metal foils remain a critical challenge in this field, which may diminish the energy density of the battery drastically. According to a theretical calculation, the thickness of a Zn metal anode with an areal capacity of 1 mAh cm-2is about 1.7 μm. However, existing extrusion-based fabrication technologies are not capable of downscaling the thickness Zn metal foils below 20 μm.

    Herein, we demonstrate a thickness controllable coating approach to fabricate an ultrathin Zn metal anode as well as a thin dielectric oxide separator. First, a 1.7 μm Zn layer was uniformly thermally evaporated onto a Cu foil. Then, Al2O3, the separator was deposited through sputtering on the Zn layer to a thickness of 10 nm. The inert and high hardness Al2O3layer is expected to lower the polarization and restrain the growth of Zn dendrites. Atomic force microscopy was employed to evaluate the roughness of the surface of the deposited Zn and Al2O3/Zn anode structures. Long-term cycling stability was gauged under the symmetrical cells at 0.5 mA cm-2for 1 mAh cm-2. Then the fabricated Zn anode was paired with MnO2as a full cell for further electrochemical performance testing. To investigate the evolution of the interface between the Zn anode and the electrolyte, a home-developed in-situ optical observation battery cage was employed to record and compare the process of Zn deposition on the anodes of the Al2O3/Zn (demonstrated in this study) and the procured thick Zn anode. The surface morphology of the two Zn anodes after circulation was characterized and compared through scanning electron microscopy. The tunable ultrathin Zn metal anode with enhanced anode stability provides a pathway for future high-energy-density Zn-ion batteries.

    Obama, B., The irreversible momentum of clean energy.Science2017,355(6321), 126-129.

    Goodenough, J. B.; Park, K. S., The Li-ion rechargeable battery: a perspective.J Am Chem Soc2013,135(4), 1167-76.

    Li, C.; Xie, X.; Liang, S.; Zhou, J., Issues and Future Perspective on Zinc Metal Anode for Rechargeable Aqueous Zinc‐ion Batteries.Energy & Environmental Materials2020,3(2), 146-159.

    Jia, H.; Wang, Z.; Tawiah, B.; Wang, Y.; Chan, C.-Y.; Fei, B.; Pan, F., Recent advances in zinc anodes for high-performance aqueous Zn-ion batteries.Nano Energy2020,70.

    Yang, J.; Yin, B.; Sun, Y.; Pan, H.; Sun, W.; Jia, B.; Zhang, S.; Ma, T., Zinc Anode for Mild Aqueous Zinc-Ion Batteries: Challenges, Strategies, and Perspectives.Nanomicro Lett2022,14(1), 42.

    Yang, Q.; Li, Q.; Liu, Z.; Wang, D.; Guo, Y.; Li, X.; Tang, Y.; Li, H.; Dong, B.; Zhi, C., Dendrites in Zn-Based Batteries.Adv Mater2020,32(48), e2001854.

    Acknowledgment

    This work was partially supported by the U.S. National Science Foundation (NSF) Award No. ECCS-1931088. S.L. and H.W.S. acknowledge the support from the Improvement of Measurement Standards and Technology for Mechanical Metrology (Grant No. 22011044) by KRISS.

    Figure 1

     

    more » « less
  3. The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection{T1,T2,… ,Td}ofdstrings (called documents) of total lengthninto a data structure, such that for any given query(P,k), wherePis a string (called pattern) of lengthp ≥ 1andk ∈ [1,d]is an integer, the identifiers of thosekdocuments that are most relevant toPcan be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n)-space (in words) data structure withO(p+klogk)query time. The query time was later improved toO(p+k)[SODA 2012] and further toO(p/logσn+k)[SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσis the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n)-space and answer queries inO(p/B+ logBn + k/B+log*(n/B)) I/Os, whereBis the block size. The second one takesO(nlog*(n/B)) space and answer queries in optimalO(p/B+ logBn + k/B)I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-kdocument retrieval, we present anO(nlog(d/B))space data structure with optimal query cost.

     
    more » « less
  4. null (Ed.)
    Consider an algorithm performing a computation on a huge random object (for example a random graph or a "long" random walk). Is it necessary to generate the entire object prior to the computation, or is it possible to provide query access to the object and sample it incrementally "on-the-fly" (as requested by the algorithm)? Such an implementation should emulate the random object by answering queries in a manner consistent with an instance of the random object sampled from the true distribution (or close to it). This paradigm is useful when the algorithm is sub-linear and thus, sampling the entire object up front would ruin its efficiency. Our first set of results focus on undirected graphs with independent edge probabilities, i.e. each edge is chosen as an independent Bernoulli random variable. We provide a general implementation for this model under certain assumptions. Then, we use this to obtain the first efficient local implementations for the Erdös-Rényi G(n,p) model for all values of p, and the Stochastic Block model. As in previous local-access implementations for random graphs, we support Vertex-Pair and Next-Neighbor queries. In addition, we introduce a new Random-Neighbor query. Next, we give the first local-access implementation for All-Neighbors queries in the (sparse and directed) Kleinberg’s Small-World model. Our implementations require no pre-processing time, and answer each query using O(poly(log n)) time, random bits, and additional space. Next, we show how to implement random Catalan objects, specifically focusing on Dyck paths (balanced random walks on the integer line that are always non-negative). Here, we support Height queries to find the location of the walk, and First-Return queries to find the time when the walk returns to a specified location. This in turn can be used to implement Next-Neighbor queries on random rooted ordered trees, and Matching-Bracket queries on random well bracketed expressions (the Dyck language). Finally, we introduce two features to define a new model that: (1) allows multiple independent (and even simultaneous) instantiations of the same implementation, to be consistent with each other without the need for communication, (2) allows us to generate a richer class of random objects that do not have a succinct description. Specifically, we study uniformly random valid q-colorings of an input graph G with maximum degree Δ. This is in contrast to prior work in the area, where the relevant random objects are defined as a distribution with O(1) parameters (for example, n and p in the G(n,p) model). The distribution over valid colorings is instead specified via a "huge" input (the underlying graph G), that is far too large to be read by a sub-linear time algorithm. Instead, our implementation accesses G through local neighborhood probes, and is able to answer queries to the color of any given vertex in sub-linear time for q ≥ 9Δ, in a manner that is consistent with a specific random valid coloring of G. Furthermore, the implementation is memory-less, and can maintain consistency with non-communicating copies of itself. 
    more » « less
  5. Given a collection of vectors, the approximate K-nearest-neighbor graph (KGraph for short) connects every vector to its approximate K-nearest-neighbors (KNN for short). KGraph plays an important role in high dimensional data visualization, semantic search, manifold learning, and machine learning. The vectors are typically vector representations of real-world objects (e.g., images and documents), which often come with a few structured attributes, such as times-tamps and locations. In this paper, we study the all-range approximate K-nearest-neighbor graph (ARKGraph) problem. Specifically, given a collection of vectors, each associated with a numerical search key (e.g., a timestamp), we aim to build an index that takes a search key range as the query and returns the KGraph of vectors whose search keys are within the query range. ARKGraph can facilitate interactive high dimensional data visualization, data mining, etc. A key challenge of this problem is the huge index size. This is because, given n vectors, a brute-force index stores a KGraph for every search key range, which results in O (K n 3 ) index size as there are O ( n 2 ) search key ranges and each KGraph takes O (K n ) space. We observe that the KNN of a vector in nearby ranges are often the same, which can be grouped together to save space. Based on this observation, we propose a series of novel techniques that reduce the index size significantly to just O (K n log n ) in the average case. Furthermore, we develop an efficient indexing algorithm that constructs the optimized ARKGraph index directly without exhaustively calculating the distance between every pair of vectors. To process a query, for each vector in the query range, we only need O (log log n + K log K) to restore its KNN in the query range from the optimized ARKGraph index. We conducted extensive experiments on real-world datasets. Experimental results show that our optimized ARKGraph index achieved a small index size, low query latency, and good scalability. Specifically, our approach was 1000x faster than the baseline method that builds a KGraph for all the vectors in the query range on-the-fly. 
    more » « less