There has been a growing interest in the graph-streaming setting where
a continuous stream of graph updates is mixed with graph queries. In
principle, purely-functional trees are an ideal fit for this
setting as they enable safe parallelism, lightweight snapshots, and
strict serializability for queries. However, directly using them for
graph processing leads to significant space overhead and poor cache
locality.
This paper presents C-trees, a compressed purely-functional
search tree data structure that significantly improves on the space
usage and locality of purely-functional trees. We design
theoretically-efficient and practical algorithms for performing batch
updates to C-trees, and also show that we can store massive
dynamic real-world graphs using only a few bytes per edge, thereby
achieving space usage close to that of the best static graph
processing frameworks.
To study the applicability of our data structure, we designed Aspen, a
graph-streaming framework that extends the interface of Ligra with
operations for updating graphs. We show that Aspen is faster than two
state-of-the-art graph-streaming systems, Stinger and LLAMA, while
requiring less memory, and is competitive in performance with the
state-of-the-art static graph frameworks, Galois, GAP, and Ligra+.
With Aspen, we are able to efficiently process the largest
publicly-available graph with over two hundred billion edges in the
graph-streaming setting using a single commodity multicore server with
1TB of memory.
more »
« less
Time Lattice: A Data Structure for the Interactive Visual Analysis of Large Time Series
Advances in technology coupled with the availability of low-cost sensors have resulted in the continuous generation of large time series from several sources. In order to visually explore and compare these time series at different scales, analysts need to execute online analytical processing (OLAP) queries that include constraints and group-by's at multiple temporal hierarchies. Effective visual analysis requires these queries to be interactive. However, while existing OLAP cube-based structures can support interactive query rates, the exponential memory requirement to materialize the data cube is often unsuitable for large data sets. Moreover, none of the recent space-efficient cube data structures allow for updates. Thus, the cube must be re-computed whenever there is new data, making them impractical in a streaming scenario. We propose Time Lattice, a memory-efficient data structure that makes use of the implicit temporal hierarchy to enable interactive OLAP queries over large time series. Time Lattice is a subset of a fully materialized cube and is designed to handle fast updates and streaming data. We perform an experimental evaluation which shows that the space efficiency of the data structure does not hamper its performance when compared to the state of the art. In collaboration with signal processing and acoustics research scientists, we use the Time Lattice data structure to design the Noise Profiler, a web-based visualization framework that supports the analysis of noise from cities. We demonstrate the utility of Noise Profiler through a set of case studies.
more »
« less
- Award ID(s):
- 1730396
- NSF-PAR ID:
- 10062986
- Date Published:
- Journal Name:
- Computer graphics forum
- ISSN:
- 1467-8659
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)For compute-intensive iterative queries over a streaming graph, it is critical to evaluate the queries continuously and incrementally for best efficiency. However, the existing incremental graph processing requires a priori knowledge of the query (e.g., the source vertex of a vertex-specific query); otherwise, it has to fall back to the expensive full evaluation that starts from scratch. To alleviate this restriction, this work presents a principled solution to generalizing the incremental graph processing, such that queries, without their a priori knowledge, can also be evaluated incrementally. The solution centers around the concept of graph triangle inequalities, an idea inspired by the classical triangle inequality principle in the Euclidean space. Interestingly, similar principles can also be derived for many vertex-specific graph problems. These principles can help establish rigorous constraints between the evaluation of one graph query and the results of another, thus enabling reusing the latter to accelerate the former. Based on this finding, a novel streaming graph system, called Tripoline, is built which enables incremental evaluation of queries without their a priori knowledge. Built on top of a state-of-the-art shared-memory streaming graph engine (Aspen), Tripoline natively supports high-throughput low-cost graph updates. A systematic evaluation with a set of eight vertex-specific graph problems and four real-world large graphs confirms both the effectiveness of the proposed techniques and the efficiency of Tripoline.more » « less
-
We initiate the study of biologically-inspired spiking neural networks from the perspective of streaming algorithms. Like computers, human brains face memory limitations, which pose a significant obstacle when processing large scale and dynamically changing data. In computer science, these challenges are captured by the well-known streaming model, which can be traced back to Munro and Paterson `78 and has had significant impact in theory and beyond. In the classical streaming setting, one must compute a function f of a stream of updates š® = {uā,ā¦,u_m}, given restricted single-pass access to the stream. The primary complexity measure is the space used by the algorithm. In contrast to the large body of work on streaming algorithms, relatively little is known about the computational aspects of data processing in spiking neural networks. In this work, we seek to connect these two models, leveraging techniques developed for streaming algorithms to better understand neural computation. Our primary goal is to design networks for various computational tasks using as few auxiliary (non-input or output) neurons as possible. The number of auxiliary neurons can be thought of as the "space" required by the network. Previous algorithmic work in spiking neural networks has many similarities with streaming algorithms. However, the connection between these two space-limited models has not been formally addressed. We take the first steps towards understanding this connection. On the upper bound side, we design neural algorithms based on known streaming algorithms for fundamental tasks, including distinct elements, approximate median, and heavy hitters. The number of neurons in our solutions almost match the space bounds of the corresponding streaming algorithms. As a general algorithmic primitive, we show how to implement the important streaming technique of linear sketching efficiently in spiking neural networks. On the lower bound side, we give a generic reduction, showing that any space-efficient spiking neural network can be simulated by a space-efficient streaming algorithm. This reduction lets us translate streaming-space lower bounds into nearly matching neural-space lower bounds, establishing a close connection between the two models.more » « less
-
Ranjit Jhala and Isil Dillig (Ed.)Many modern programming languages are shifting toward a functional style for collection interfaces such as sets, maps, and sequences. Functional interfaces offer many advantages, including being safe for parallelism and providing simple and lightweight snapshots. However, existing high-performance functional interfaces such as PAM, which are based on bal- anced purely-functional trees, incur large space overheads for large-scale data analysis due to storing every element in a separate node in a tree. This paper presents PaC-trees, a purely-functional data structure supporting functional interfaces for sets, maps, and sequences that provides a significant reduction in space over existing approaches. A PaC-tree is a balanced binary search tree which blocks the leaves and compresses the blocks us- ing arrays. We provide novel techniques for compressing and uncompressing the blocks which yield practical parallel functional algorithms for a broad set of operations on PaC- trees such as union, intersection, filter, reduction, and range queries which are both theoretically and practically efficient. Using PaC-trees we designed CPAM, a C++ library that im- plements the full functionality of PAM, while offering signifi- cant extra functionality for compression. CPAM consistently matches or outperforms PAM on a set of microbenchmarks on sets, maps, and sequences while using about a quarter of the space. On applications including inverted indices, 2D range queries, and 1D interval queries, CPAM is competitive with or faster than PAM, while using 2.1Å7.8x less space. For static and streaming graph processing, CPAM offers 1.6x faster batch updates while using 1.3Å2.6x less space than the state-of-the-art graph processing system Aspen.more » « less
-
We introduce a new sub-linear space sketch the "Weight-Median Sketch" for learning compressed linear classifiers over data streams while supporting the efficient recovery of large-magnitude weights in the model. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. Unlike related sketches that capture the most frequently-occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median Sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis that establishes recovery guarantees for batch and online learning, and demonstrate empirical improvements in memory-accuracy trade-offs over alternative memory-budgeted methods, including count-based sketches and feature hashing.more » « less