skip to main content


Title: Database Benchmarking for Supporting Real-Time Interactive Querying of Large Data
In this paper, we present a new benchmark to validate the suitability of database systems for interactive visualization workloads. While there exist proposals for evaluating database systems on interactive data exploration workloads, none rely on real user traces for database benchmarking. To this end, our long term goal is to collect user traces that represent workloads with different exploration characteristics. In this paper, we present an initial benchmark that focuses on "crossfilter"-style applications, which are a popular interaction type for data exploration and a particularly demanding scenario for testing database system performance. We make our benchmark materials, including input datasets, interaction sequences, corresponding SQL queries, and analysis code, freely available as a community resource, to foster further research in this area: https://osf.io/9xerb/?view_only=81de1a3f99d04529b6b173a3bd5b4d23.  more » « less
Award ID(s):
1850115
PAR ID:
10301347
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
Page Range / eLocation ID:
1571 to 1587
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Persistent key-value stores are widely used as building blocks in today’s IT infrastructure for managing and storing large amounts of data. However, studies of characterizing real-world workloads for key-value stores are limited due tothe lack of tracing/analyzing tools and the difficulty of collecting traces in operational environments. In this paper, we first present a detailed characterization of workloads from three typical RocksDB production use cases at Facebook: UDB (a MySQL storage layer for social graph data), ZippyDB (a distributed key-value store), and UP2X (a distributed key-value store for AI/ML services). These characterizations reveal several interesting findings: first, that the distribution of key and value sizes are highly related to the use cases/applications; second, that the accesses to key-value pairs have a good locality and follow certain special patterns; and third, that the collected performance metrics show a strong diurnal pattern in the UDB, but not the other two. We further discover that although the widely used key-value benchmark YCSB provides various workload configurations and key-value pair access distribution models, the YCSB triggered workloads for underlying storage systems are still not close enough to the workloads we collected due to ignorance of key-space localities. To address this issue, we propose a key-range based modeling and develop a benchmark that can better emulate the workloads of real-world key-value stores. This benchmark can synthetically generate more precise key-value queries that represent the reads and writes of key-value stores to the underlying storage system. 
    more » « less
  2. Interactive web-based applications play an important role for both service providers and consumers. However, web applications tend to be complex, produce high-volume data, and are often ripe for attack. Attack analysis and remediation are complicated by adversary obfuscation and the difficulty in assembling and analyzing logs. In this work, we explore the web application analysis task through log file fusion, distillation, and visualization. Our approach consists of visualizing the logs of web and database traffic with detailed function execution traces. We establish causal links between events and their associated behaviors. We evaluate the effectiveness of this process using data volume reduction statistics, user interaction models, and usage scenarios. Across a set of scenarios, we find that our techniques can filter at least 97.5% of log data and reduce analysis time by 93-96%. 
    more » « less
  3. null (Ed.)
    Unlike traditional object stores, Augmented Reality (AR) query workloads possess several unique characteristics, such as spatial and visual information. Such workloads are often keyed on a variety of attributes simultaneously, such as device orientation and position, the scene in view, and spatial anchors. The natural mode of user-interaction in these devices triggers queries implicitly based on the field in the user's view at any instant, generating data queries in excess of the device frame rate. Ensuring a smooth user experience in such a scenario requires a systemic solution exploiting the unique characteristics of the AR workloads. For exploration in such contexts, we are presented with a view-maintenance or cache-prefetching problem; how do we download the smallest subset from the server to the mixed reality device such that latency and device space constraints are met? We present a novel data platform - DreamStore, that considers AR queries as first-class queries, and view-maintenance and large-scale analytics infrastructure around this design choice. Through performance experiments on large-scale and query-intensive AR workloads on DreamStore, we show the advantages and the capabilities of our proposed platform. 
    more » « less
  4. Technology advances and lower equipment costs are enabling non-invasive, convenient recording of brain data outside of clinical settings in more real-world environments, and by non-experts. Despite the growing interest in and availability of brain signal datasets, most analytical tools are made for experts in the specific device technology, and have rigid constraints on the type of analysis available. We developed BrainEx to support interactive exploration and discovery within brain signals datasets. BrainEx takes advantage of algorithms that enable fast exploration of complex, large collections of time series data, while being easy to use and learn. This system enables researchers to perform similarity search, explore feature data and natural clustering, and select sequences of interest for future searches and exploration, while also maintaining the usability of a visual tool. In addition to describing the distributed architecture and visual design for BrainEx, this paper reports on a benchmark experiment showing that it outperforms other existing systems for similarity search. Additionally, we report on a preliminary user study in which domain experts used the visual exploration interface and affirmed that it meets the requirements. Finally, it presents a case study using BrainEx to explore real-world, domain-relevant data. 
    more » « less
  5. Embedded database libraries provide developers with a com- mon and convenient data persistence layer. They have spread to many systems, including interactive devices like smart- phones, appearing in all major mobile systems. Their perfor- mance affects the response times and resource consumption of millions of phone apps and billions of phone users. It is thus critical that we better understand how they work, so they can be used more efficiently, and so developers can make faster libraries. Mobile databases differ significantly from server-class storage in terms of platform, usage, and measurement. Phones are multi-tenant, end-user devices that the database must share with other apps. Contrary to traditional database design goals, workloads on phones are single-app, bursty, and rarely saturate the CPU. We argue that mobile storage design should refocus on what matters on the mobile platform: latency and energy. As accurate per- formance measurement tools are necessary to evaluation of good database design, this uncovers another issue: Tradi- tional database benchmarking methods produce misleading results when applied to mobile devices, due to evaluating performance at saturation. Development of databases and measurements specifically designed for the mobile platform is necessary to optimize user experience of the most common database usage in the world. 
    more » « less