skip to main content


Title: NeuralCubes: Deep Representations for Visual Data Exploration
Visual exploration of large multi-dimensional datasets has seen tremendous progress in recent years, allowing users to express rich data queries that produce informative visual summaries, all in real time. Techniques based on data cubes are some of the most promising approaches. However, these techniques usually require a large memory footprint for large datasets. To tackle this problem, we present NeuralCubes: neural networks that predict results for aggregate queries, similar to data cubes. NeuralCubes learns a function that takes as input a given query, for instance, a geographic region and temporal interval, and outputs the result of the query. The learned function serves as a real-time, low-memory approximator for aggregation queries. Our models are small enough to be sent to the client side (e.g. the web browser for a web-based application) for evaluation, enabling data exploration of large datasets without database/network connection. We demonstrate the effectiveness of NeuralCubes through extensive experiments on a variety of datasets and discuss how NeuralCubes opens up opportunities for new types of visualization and interaction.  more » « less
Award ID(s):
1940175
NSF-PAR ID:
10394317
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
2021 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
550 to 561
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Advances in technology coupled with the availability of low-cost sensors have resulted in the continuous generation of large time series from several sources. In order to visually explore and compare these time series at different scales, analysts need to execute online analytical processing (OLAP) queries that include constraints and group-by's at multiple temporal hierarchies. Effective visual analysis requires these queries to be interactive. However, while existing OLAP cube-based structures can support interactive query rates, the exponential memory requirement to materialize the data cube is often unsuitable for large data sets. Moreover, none of the recent space-efficient cube data structures allow for updates. Thus, the cube must be re-computed whenever there is new data, making them impractical in a streaming scenario. We propose Time Lattice, a memory-efficient data structure that makes use of the implicit temporal hierarchy to enable interactive OLAP queries over large time series. Time Lattice is a subset of a fully materialized cube and is designed to handle fast updates and streaming data. We perform an experimental evaluation which shows that the space efficiency of the data structure does not hamper its performance when compared to the state of the art. In collaboration with signal processing and acoustics research scientists, we use the Time Lattice data structure to design the Noise Profiler, a web-based visualization framework that supports the analysis of noise from cities. We demonstrate the utility of Noise Profiler through a set of case studies. 
    more » « less
  2. Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptoticguarantees,makingthem unreliable, or techniques that are correct but not tight, i.e., they offer rigorous guarantees, but are overly conservative, leading to confidence intervals that are too loose to be useful. In this paper, we develop a CI technique that is both correct and tighter than traditional approaches. Starting from conservative CIs, we identify two issues they often face: pessimistic mass allocation (PMA) and phantom outlier sensitivity (PHOS). By developing a novel range-trimming technique for eliminating PHOS and pairing it with known CI techniques without PMA, we develop a technique for computing CIs with strong guarantees that requires fewer samples for the same width. We implement our techniques underneath a sampling-optimized in-memory column store and show how they accelerate queries involving aggregates on real datasets with typical speedups on the order of 10× over both traditional AQP-with-guarantees and exact methods, all while obeying accuracy constraints. 
    more » « less
  3. Recent advancements in deep learning techniques facilitate intelligent-query support in diverse applications, such as content-based image retrieval and audio texturing. Unlike conventional key-based queries, these intelligent queries lack efficient indexing and require complex compute operations for feature matching. To achieve high-performance intelligent querying against massive datasets, modern computing systems employ GPUs in-conjunction with solid-state drives (SSDs) for fast data access and parallel data processing. However, our characterization with various intelligent-query workloads developed with deep neural networks (DNNs), shows that the storage I/O bandwidth is still the major bottleneck that contributes 56%--90% of the query execution time. To this end, we present DeepStore, an in-storage accelerator architecture for intelligent queries. It consists of (1) energy-efficient in-storage accelerators designed specifically for supporting DNN-based intelligent queries, under the resource constraints in modern SSD controllers; (2) a similarity-based in-storage query cache to exploit the temporal locality of user queries for further performance improvement; and (3) a lightweight in-storage runtime system working as the query engine, which provides a simple software abstraction to support different types of intelligent queries. DeepStore exploits SSD parallelisms with design space exploration for achieving the maximal energy efficiency for in-storage accelerators. We validate DeepStore design with an SSD simulator, and evaluate it with a variety of vision, text, and audio based intelligent queries. Compared with the state-of-the-art GPU+SSD approach, DeepStore improves the query performance by up to 17.7×, and energy-efficiency by up to 78.6×. 
    more » « less
  4. Deep learning has improved state-of-the-art results in many important fields, and has been the subject of much research in recent years, leading to the development of several systems for facilitating deep learning. Current systems, however, mainly focus on model building and training phases, while the issues of data management, model sharing, and lifecycle management are largely ignored. Deep learning modeling lifecycle generates a rich set of data artifacts, e.g., learned parameters and training logs, and it comprises of several frequently conducted tasks, e.g., to understand the model behaviors and to try out new models. Dealing with such artifacts and tasks is cumbersome and largely left to the users. This paper describes our vision and implementation of a data and lifecycle management system for deep learning. First, we generalize model exploration and model enumeration queries from commonly conducted tasks by deep learning modelers, and propose a high-level domain specific language (DSL), inspired by SQL, to raise the abstraction level and thereby accelerate the modeling process. To manage the variety of data artifacts, especially the large amount of checkpointed float parameters, we design a novel model versioning system (dlv), and a read-optimized parameter archival storage system (PAS) that minimizes storage footprint and accelerates query workloads with minimal loss of accuracy. PAS archives versioned models using deltas in a multi-resolution fashion by separately storing the less significant bits, and features a novel progressive query (inference) evaluation algorithm. Third, we develop efficient algorithms for archiving versioned models using deltas under co-retrieval constraints. We conduct extensive experiments over several real datasets from computer vision domain to show the efficiency of the proposed techniques. 
    more » « less
  5. The unprecedented rise of social media platforms, combined with location-aware technologies, has led to continuously producing a significant amount of geo-social data that flows as a user-generated data stream. This data has been exploited in several important use cases in various application domains. This article supports geo-social personalized queries in streaming data environments. We define temporal geo-social queries that provide users with real-time personalized answers based on their social graph. The new queries allow incorporating keyword search to get personalized results that are relevant to certain topics. To efficiently support these queries, we propose an indexing framework that provides lightweight and effective real-time indexing to digest geo-social data in real time. The framework distinguishes highly dynamic data from relatively stable data and uses appropriate data structures and a storage tier for each. Based on this framework, we propose a novel geo-social index and adopt two baseline indexes to support the addressed queries. The query processor then employs different types of pruning to efficiently access the index content and provide a real-time query response. The extensive experimental evaluation based on real datasets has shown the superiority of our proposed techniques to index real-time data and provide low-latency queries compared to existing competitors.

     
    more » « less