Vector databases have recently gained significant attention due to the emergence of large language models that produce vector embeddings for text. Existing vector databases can be broadly categorized into two types: specialized and generalized. Specialized vector databases are explicitly designed and optimized for managing vector data, while generalized ones support vector data management within a general purpose database. While specialized vector databases are interesting, there is a substantial customer base interested in generalized vector databases for various reasons, e.g., a reluctance to move data out of relational databases to reduce data silos and costs, the desire to use SQL, and the need for more sophisticated query processing of vector and non-vector data. However, generalized vector databases face two main challenges: performance and interoperability of vector search with SQL, such as combining vector search with filters, joins, or even fulltext search. In this paper, we present SingleStore-V, a full-fledged generalized vector database integrated into SingleStore, a modern distributed relational database optimized for both OLAP and OLTP workloads. SingleStore-V achieves high performance and interoperability via a suite of optimizations. Experiments on standard vector benchmarks show that SingleStore-V performs comparably to Milvus, a highly-optimized specialized vector database, and significantly outperforms pgvector, a popular generalized vector database in PostgreSQL. We believe this paper will shed light on integrating vector search into relational databases in general, as many design concepts and optimizations apply to other databases.
more »
« less
Accelerating Skewed Workloads With Performance Multipliers in the TurboDB Distributed Database
Distributed databases suffer from performance degradation under skewed workloads. Such workloads cause high contention, which is exacerbated by cross-node network latencies. In contrast, single-machine databases better handle skewed workloads because their centralized nature enables performance optimizations that execute contended requests more efficiently. Based on this insight, we propose a novel hybrid architecture that employs a single-machine database inside a distributed database and present TurboDB, the first distributed database that leverages this hybrid architecture to achieve up to an order of magnitude better performance than representative solutions under skewed workloads. TurboDB introduces two designs to tackle the core challenges unique to its hybrid architecture. First, Hybrid Concurrency Control is a specialized technique that coordinates the single-machine and distributed databases to collectively ensure process-ordered serializability. Second, Phalanx Replication provides fault tolerance for the single-machine database without significantly sacrificing its performance benefits. We implement TurboDB using CockroachDB and Cicada as the distributed and single-machine databases, respectively. Our evaluation shows that TurboDB significantly improves the performance of CockroachDB under skewed workloads.
more »
« less
- PAR ID:
- 10516823
- Publisher / Repository:
- 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24)
- Date Published:
- ISBN:
- 978-1-939133-39-7
- Format(s):
- Medium: X
- Location:
- SANTA CLARA, CA, USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Embedded database libraries provide developers with a com- mon and convenient data persistence layer. They have spread to many systems, including interactive devices like smart- phones, appearing in all major mobile systems. Their perfor- mance affects the response times and resource consumption of millions of phone apps and billions of phone users. It is thus critical that we better understand how they work, so they can be used more efficiently, and so developers can make faster libraries. Mobile databases differ significantly from server-class storage in terms of platform, usage, and measurement. Phones are multi-tenant, end-user devices that the database must share with other apps. Contrary to traditional database design goals, workloads on phones are single-app, bursty, and rarely saturate the CPU. We argue that mobile storage design should refocus on what matters on the mobile platform: latency and energy. As accurate per- formance measurement tools are necessary to evaluation of good database design, this uncovers another issue: Tradi- tional database benchmarking methods produce misleading results when applied to mobile devices, due to evaluating performance at saturation. Development of databases and measurements specifically designed for the mobile platform is necessary to optimize user experience of the most common database usage in the world.more » « less
-
Key-value (KV) software has proven useful to a wide variety of applications including analytics, time-series databases, and distributed file systems. To satisfy the requirements of diverse workloads, KV stores have been carefully tailored to best match the performance characteristics of underlying solid-state block devices. Emerging KV storage device is a promising technology for both simplifying the KV software stack and improving the performance of persistent storage-based applications. However, while providing fast, predictable put and get operations, existing KV storage devices don’t natively support range queries which are critical to all three types of applications described above. In this paper, we present KVRangeDB, a software layer that enables processing range queries for existing hash-based KV solid-state disks (KVSSDs). As an effort to adapt to the performance characteristics of emerging KVSSDs, KVRangeDB implements log-structured merge tree key index that reduces compaction I/O, merges keys when possible, and provides separate caches for indexes and values. We evaluated the KVRangeDB under a set of representative workloads, and compared its performance with two existing database solutions: a Rocksdb variant ported to work with the KVSSD, and Wisckey, a key-value database that is carefully tuned for conventional block devices. On filesystem aging workloads, KVRangeDB outperforms Wisckey by 23.7x in terms of throughput and reduce CPU usage and external write amplifications by 14.3x and 9.8x, respectively.more » « less
-
Storage is the Achilles heel of hybrid cloud deployments of workloads. Accessing persistent state over a WAN link, even a dedicated one, delivers an over-whelming performance blow to application performance. We propose FAB, a new storage architecture for the hybrid cloud. FAB addresses two major challenges for hybrid cloud storage, performance efficiency and backup efficiency. It does so by creating a new FAB layer in the storage stack that enables fault-tolerance, performance acceleration, and backup for FAB storage volumes. A preliminary evaluation of FAB's performance acceleration mechanism when deployed over Ceph's distributed block storage system offers encouragement to pursue this new hybrid cloud storage architecture.more » « less
-
We show that it is possible to achieve information theoretic location privacy for secondary users (SUs) in database-driven cognitive radio networks (CRNs) with an end-to-end delay less than a second, which is significantly better than that of the existing alternatives offering only a computational privacy. This is achieved based on a keen observation that, by the requirement of Federal Communications Commission (FCC), all certified spectrum databases synchronize their records. Hence, the same copy of spectrum database is available through multiple (distinct) providers. We harness the synergy between multi-server private information retrieval (PIR) and database-driven CRN architecture to offer an optimal level of privacy with high efficiency by exploiting this observation. We demonstrated, analytically and experimentally with deployments on actual cloud systems that, our adaptations of multi-server PIR outperform that of the (currently) fastest single-server PIR by a magnitude of times with information-theoretic security, collusion resiliency, and fault-tolerance features. Our analysis indicates that multi-server PIR is an ideal cryptographic tool to provide location privacy in database-driven CRNs, in which the requirement of replicated databases is a natural part of the system architecture, and therefore SUs can enjoy all advantages of multi-server PIR without any additional architectural and deployment costs.more » « less
An official website of the United States government

