Social media platforms generate massive amounts of data that reveal valuable insights about users and communities at large. Existing techniques have not fully exploited such data to help practitioners perform a deep analysis of large online communities. Lack of scalability hinders analyzing communities of large sizes and requires tremendous system resources and unacceptable runtime. This article proposes a new analytical query that identifies the top-kposts that a given user community has interacted with during a specific time interval and within a spatial range. We propose a novel indexing framework that captures the interactions of users and communities to provide a low query latency. Moreover, we propose exact and approximate algorithms to process the query efficiently and utilize the index content to prune the search space. The extensive experimental evaluation on real data has shown the superiority of our techniques and their scalability to support large online communities. 
                        more » 
                        « less   
                    
                            
                            Reverse spatial top-k keyword queries
                        
                    
    
            Abstract We introduce theReverseSpatial Top-kKeyword (RSK)query, which is defined as:given a query term q, an integer k and a neighborhood size find all the neighborhoods of that size where q is in the top-k most frequent terms among the social posts in those neighborhoods. An obvious approach would be to partition the dataset with a uniform grid structure of a given cell size and identify the cells where this term is in the top-k most frequent keywords. However, this answer would be incomplete since it only checks for neighborhoods that are perfectly aligned with the grid. Furthermore, for every neighborhood (square) that is an answer, we can define infinitely more result neighborhoods by minimally shifting the square without including more posts in it. To address that, we need to identify contiguous regions where any point in the region can be the center of a neighborhood that satisfies the query. We propose an algorithm to efficiently answer an RSK query using an index structure consisting of a uniform grid augmented by materialized lists of term frequencies. We apply various optimizations that drastically improve query latency against baseline approaches. We also provide a theoretical model to choose the optimal cell size for the index to minimize query latency. We further examine a restricted version of the problem (RSKR) that limits the scope of the answer and propose efficientapproximatealgorithms. Finally, we examine how parallelism can improve performance by balancing the workload using a smartload slicingtechnique. Extensive experimental performance evaluation of the proposed methods using real Twitter datasets and crime report datasets, shows the efficiency of our optimizations and the accuracy of the proposed theoretical model. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10374636
- Publisher / Repository:
- Springer Science + Business Media
- Date Published:
- Journal Name:
- The VLDB Journal
- ISSN:
- 1066-8888
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection{T1,T2,… ,Td}ofdstrings (called documents) of total lengthninto a data structure, such that for any given query(P,k), wherePis a string (called pattern) of lengthp ≥ 1andk ∈ [1,d]is an integer, the identifiers of thosekdocuments that are most relevant toPcan be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n)-space (in words) data structure withO(p+klogk)query time. The query time was later improved toO(p+k)[SODA 2012] and further toO(p/logσn+k)[SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσis the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n)-space and answer queries inO(p/B+ logBn + k/B+log*(n/B)) I/Os, whereBis the block size. The second one takesO(nlog*(n/B)) space and answer queries in optimalO(p/B+ logBn + k/B)I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-kdocument retrieval, we present anO(nlog(d/B))space data structure with optimal query cost.more » « less
- 
            Xu, Gang (Ed.)Recent advances in quantitative tools for examining urban morphology enable the development of morphometrics that can characterize the size, shape, and placement of buildings; the relationships between them; and their association with broader patterns of development. Although these methods have the potential to provide substantial insight into the ways in which neighborhood morphology shapes the socioeconomic and demographic characteristics of neighborhoods and communities, this question is largely unexplored. Using building footprints in five of the ten largest U.S. metropolitan areas (Atlanta, Boston, Chicago, Houston, and Los Angeles) and the open-source R package,foot, we examine how neighborhood morphology differs across U.S. metropolitan areas and across the urban-exurban landscape. Principal components analysis, unsupervised classification (K-means), and Ordinary Least Squares regression analysis are used to develop a morphological typology of neighborhoods and to examine its association with the spatial, socioeconomic, and demographic characteristics of census tracts. Our findings illustrate substantial variation in the morphology of neighborhoods, both across the five metropolitan areas as well as between central cities, suburbs, and the urban fringe within each metropolitan area. We identify five different types of neighborhoods indicative of different stages of development and distributed unevenly across the urban landscape: these include low-density neighborhoods on the urban fringe; mixed use and high-density residential areas in central cities; and uniform residential neighborhoods in suburban cities. Results from regression analysis illustrate that the prevalence of each of these forms is closely associated with variation in socioeconomic and demographic characteristics such as population density, the prevalence of multifamily housing, and income, race/ethnicity, homeownership, and commuting by car. We conclude by discussing the implications of our findings and suggesting avenues for future research on neighborhood morphology, including ways that it might provide insight into issues such as zoning and land use, housing policy, and residential segregation.more » « less
- 
            Given a collection of vectors, the approximate K-nearest-neighbor graph (KGraph for short) connects every vector to its approximate K-nearest-neighbors (KNN for short). KGraph plays an important role in high dimensional data visualization, semantic search, manifold learning, and machine learning. The vectors are typically vector representations of real-world objects (e.g., images and documents), which often come with a few structured attributes, such as times-tamps and locations. In this paper, we study the all-range approximate K-nearest-neighbor graph (ARKGraph) problem. Specifically, given a collection of vectors, each associated with a numerical search key (e.g., a timestamp), we aim to build an index that takes a search key range as the query and returns the KGraph of vectors whose search keys are within the query range. ARKGraph can facilitate interactive high dimensional data visualization, data mining, etc. A key challenge of this problem is the huge index size. This is because, given n vectors, a brute-force index stores a KGraph for every search key range, which results in O (K n 3 ) index size as there are O ( n 2 ) search key ranges and each KGraph takes O (K n ) space. We observe that the KNN of a vector in nearby ranges are often the same, which can be grouped together to save space. Based on this observation, we propose a series of novel techniques that reduce the index size significantly to just O (K n log n ) in the average case. Furthermore, we develop an efficient indexing algorithm that constructs the optimized ARKGraph index directly without exhaustively calculating the distance between every pair of vectors. To process a query, for each vector in the query range, we only need O (log log n + K log K) to restore its KNN in the query range from the optimized ARKGraph index. We conducted extensive experiments on real-world datasets. Experimental results show that our optimized ARKGraph index achieved a small index size, low query latency, and good scalability. Specifically, our approach was 1000x faster than the baseline method that builds a KGraph for all the vectors in the query range on-the-fly.more » « less
- 
            One fundamental question in database theory is the following: Given a Boolean conjunctive queryQ, what is the best complexity for computing the answer to Q in terms of the input database sizeN? When restricted to the class of combinatorial algorithms, it is known that the best known complexity for any queryQis captured by thesubmodular widthofQ. However, beyond combinatorial algorithms, certain queries are known to admit faster algorithms that often involve a clever combination of fast matrix multiplication and data partitioning. Nevertheless, there is no systematic way to derive and analyze the complexity of such algorithms for arbitrary queriesQ. In this work, we introduce a general framework that captures the best complexity for answering any Boolean conjunctive queryQusing matrix multiplication. Our framework unifies both combinatorial and non-combinatorial techniques under the umbrella of information theory. It generalizes the notion of submodular width to a new stronger notion called the ω-submodular widththat naturally incorporates the power of fast matrix multiplication. We describe a matching algorithm that computes the answer to any queryQin time corresponding to the ω-submodularwidth ofQ. We show that our framework recovers the best known complexities for Boolean queries that have been studied in the literature, to the best of our knowledge, and also discovers new algorithms for some classes of queries that improve upon the best known complexities.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
