skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data storage architectures to accelerate chemical discovery: data accessibility for individual laboratories and the community
As buzzwords like “big data,” “machine learning,” and “high-throughput” expand through chemistry, chemists need to consider more than ever their data storage, data management, and data accessibility, whether in their own laboratories or with the broader community. While it is commonplace for chemists to use spreadsheets for data storage and analysis, a move towards database architectures ensures that the data can be more readily findable, accessible, interoperable, and reusable (FAIR). However, making this move has several challenges for those with limited-to-no knowledge of computer programming and databases. This Perspective presents basics of data management using databases with a focus on chemical data. We overview database fundamentals by exploring benefits of database use, introducing terminology, and establishing database design principles. We then detail the extract, transform, and load process for database construction, which includes an overview of data parsing and database architectures, spanning Standard Query Language (SQL) and No-SQL structures. We close by cataloging overarching challenges in database design. This Perspective is accompanied by an interactive demonstration available at https://github.com/D3TaLES/databases_demo. We do all of this within the context of chemical data with the aim of equipping chemists with the knowledge and skills to store, manage, and share their data while abiding by FAIR principles.  more » « less
Award ID(s):
2019574
PAR ID:
10416213
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Chemical Science
Volume:
13
Issue:
46
ISSN:
2041-6520
Page Range / eLocation ID:
13646 to 13656
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Vector databases have recently gained significant attention due to the emergence of large language models that produce vector embeddings for text. Existing vector databases can be broadly categorized into two types: specialized and generalized. Specialized vector databases are explicitly designed and optimized for managing vector data, while generalized ones support vector data management within a general purpose database. While specialized vector databases are interesting, there is a substantial customer base interested in generalized vector databases for various reasons, e.g., a reluctance to move data out of relational databases to reduce data silos and costs, the desire to use SQL, and the need for more sophisticated query processing of vector and non-vector data. However, generalized vector databases face two main challenges: performance and interoperability of vector search with SQL, such as combining vector search with filters, joins, or even fulltext search. In this paper, we present SingleStore-V, a full-fledged generalized vector database integrated into SingleStore, a modern distributed relational database optimized for both OLAP and OLTP workloads. SingleStore-V achieves high performance and interoperability via a suite of optimizations. Experiments on standard vector benchmarks show that SingleStore-V performs comparably to Milvus, a highly-optimized specialized vector database, and significantly outperforms pgvector, a popular generalized vector database in PostgreSQL. We believe this paper will shed light on integrating vector search into relational databases in general, as many design concepts and optimizations apply to other databases. 
    more » « less
  2. A multiverse database transparently presents each application user with a flexible, dynamic, and independent view of shared data. This transformed view of the entire database contains only information allowed by a centralized and easily-auditable privacy policy. By enforcing the privacy policy once, in the database, multiverse databases reduce programmer burden and eliminate many frontend bugs that expose sensitive data. Multiverse databases' per-user transformations risk expensive queries if applied dynamically on reads, or impractical storage requirements if the database proactively materializes policy-compliant views. We propose an efficient design based on a joint dataflow across "universes" that combines global, shared computation and cached state with individual, per-user processing and state. This design, which supports arbitrary SQL queries and complex policies, imposes no performance overhead on read queries. Our early prototype supports thousands of parallel universes on a single server. 
    more » « less
  3. In the post-Moore era, systems and devices with new architectures will arrive at a rapid rate with significant impacts on the software stack. Applications will not be able to fully benefit from new architectures unless they can delegate adapting to new devices in lower layers of the stack. In this paper we introduce physical design management which deals with the problem of identifying and executing transformations on physical designs of stored data, i.e. how data is mapped to storage abstractions like files, objects, or blocks, in order to improve performance. Physical design is traditionally placed with applications, access libraries, and databases, using hard- wired assumptions about underlying storage systems. Yet, storage systems increasingly not only contain multiple kinds of storage devices with vastly different performance profiles but also move data among those storage devices, thereby changing the benefit of a particular physical design. We advocate placing physical design management in storage, identify interesting research challenges, provide a brief description of a prototype implementation in Ceph, and discuss the results of initial experiments at scale that are replicable using Cloudlab. These experiments show performance and resource utilization trade-offs associated with choosing different physical designs and choosing to transform between physical designs. 
    more » « less
  4. Storage-compute disaggregation has recently emerged as a novel architecture in modern data centers, particularly in the cloud. By decoupling compute from storage, this new architecture enables independent and elastic scaling of compute and storage resources, potentially increasing resource utilization and reducing overall costs. To best leverage the disaggregated architecture, a new breed of database systems termed storage-disaggregated databases has recently been developed, such as Amazon Aurora, Microsoft Socrates, Google AlloyDB, Alibaba PolarDB, and Huawei Taurus. However, little is known about the effectiveness of the design principles in these databases since they are typically developed by industry giants, and only the overall performance results are presented without detailing the impact of individual design principles. As a result, many critical research questions remain unclear, such as the performance impact of storage-disaggregation, the log-as-the-database design, shared-storage, and various log-replay methods. In this paper, we investigate the performance implications of the design principles that are widely adopted in storage-disaggregated databases for the first time. As these databases were usually not open-sourced, we have made a significant effort to implement a storage-disaggregated database prototype based on PostgreSQL v13.0. By fully controlling and instrumenting the codebase, we are able to selectively enable and disable individual optimizations and techniques to evaluate their impact on performance in various scenarios. Furthermore, we open-source our storage-disaggregated database prototype for use by the broader database research community, fostering collaboration and innovation in this field. 
    more » « less
  5. null (Ed.)
    Abstract Current barriers hindering data-driven discoveries in deep-time Earth (DE) include: substantial volumes of DE data are not digitized; many DE databases do not adhere to FAIR (findable, accessible, interoperable and reusable) principles; we lack a systematic knowledge graph for DE; existing DE databases are geographically heterogeneous; a significant fraction of DE data is not in open-access formats; tailored tools are needed. These challenges motivate the Deep-Time Digital Earth (DDE) program initiated by the International Union of Geological Sciences and developed in cooperation with national geological surveys, professional associations, academic institutions and scientists around the world. DDE’s mission is to build on previous research to develop a systematic DE knowledge graph, a FAIR data infrastructure that links existing databases and makes dark data visible, and tailored tools for DE data, which are universally accessible. DDE aims to harmonize DE data, share global geoscience knowledge and facilitate data-driven discovery in the understanding of Earth's evolution. 
    more » « less