skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ZKSQL: Verifiable and Efficient Query Evaluation with Zero-Knowledge Proofs
Individuals and organizations are using databases to store personal information at an unprecedented rate. This creates a quandary for data providers. They are responsible for protecting the privacy of individuals described in their database. On the other hand, data providers are sometimes required to provide statistics about their data instead of sharing it wholesale with strong assurances that these answers are correct and complete such as in regulatory filings for the US SEC and other goverment organizations. We introduce a system,ZKSQL, that provides authenticated answers to ad-hoc SQL queries with zero-knowledge proofs. Its proofs show that the answers are correct and sound with respect to the database's contents and they do not divulge any information about its input records. This system constructs proofs over the steps in a query's evaluation and it accelerates this process with authenticated set operations. We validate the efficiency of this approach over a suite of TPC-H queries and our results show that ZKSQL achieves two orders of magnitude speedup over the baseline.  more » « less
Award ID(s):
2016240 1846447 2236819
PAR ID:
10469097
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
PVLDB
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
16
Issue:
8
ISSN:
2150-8097
Page Range / eLocation ID:
1804 to 1816
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the task of interorganizational data sharing, in which data owners, data clients, and data subjects have different and sometimes competing privacy concerns. One real-world scenario in which this problem arises concerns law-enforcement use of phone-call metadata: The data owner is a phone company, the data clients are law-enforcement agencies, and the data subjects are individuals who make phone calls. A key challenge in this type of scenario is that each organization uses its own set of proprietary intraorganizational attributes to describe the shared data; such attributes cannot be shared with other organizations. Moreover, data-access policies are determined by multiple parties and may be specified using attributes that are not directly comparable with the ones used by the owner to specify the data. We propose a system architecture and a suite of protocols that facilitate dynamic and efficient interorganizational data sharing, while allowing each party to use its own set of proprietary attributes to describe the shared data and preserving the confidentiality of both data records and proprietary intraorganizational attributes. We introduce the novel technique ofAttribute-Based Encryption with Oblivious Attribute Translation (OTABE), which plays a crucial role in our solution. This extension of attribute-based encryption uses semi-trusted proxies to enable dynamic and oblivious translation between proprietary attributes that belong to different organizations; it supports hidden access policies, direct revocation, and fine-grained, data-centric keys and queries. We prove that our OTABE-based framework is secure in the standard model and provide two real-world use cases. 
    more » « less
  2. The rigid schemas of classical relational databases help users in specifying queries and inform the storage organization of data. However, the advantages of schemas come at a high upfront cost through schema and ETL process design. In this work, we propose a new paradigm where the database system takes a more active role in schema development and data integration. We refer to this approach as adaptive schema databases (ASDs). An ASD ingests semi-structured or unstructured data directly using a pluggable combination of extraction and data integration techniques. Over time it discovers and adapts schemas for the ingested data using information provided by data integration and information extraction techniques, as well as from queries and user-feedback. In contrast to relational databases, ASDs maintain multiple schema workspaces that represent individualized views over the data, which are fine-tuned to the needs of a particular user or group of users. A novel aspect of ASDs is that probabilistic database techniques are used to encode ambiguity in automatically generated data extraction workflows and in generated schemas. ASDs can provide users with context-dependent feedback on the quality of a schema, both in terms of its ability to satisfy a user's queries, and the quality of the resulting answers. We outline our vision for ASDs, and present a proof of concept implementation as part of the Mimir probabilistic data curation system. 
    more » « less
  3. Website privacy policies are often lengthy and intricate. Privacy assistants assist in simplifying policies and making them more accessible and user-friendly. The emergence of generative AI (genAI) offers new opportunities to build privacy assistants that can answer users’ questions about privacy policies. However, genAI’s reliability is a concern due to its potential for producing inaccurate information. This study introduces GenAIPABench, a benchmark for evaluating Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench includes: 1) A set of curated questions about privacy policies along with annotated answers for various organizations and regulations; 2) Metrics to assess the accuracy, relevance, and consistency of responses; and 3) A tool for generating prompts to introduce privacy policies and paraphrased variants of the curated questions. We evaluated 3 leading genAI systems—ChatGPT-4, Bard, and Bing AI—using GenAIPABench to gauge their effectiveness as GenAIPAs. Our results demonstrate significant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy. 
    more » « less
  4. Website privacy policies are often lengthy and intricate. Privacy assistants assist in simplifying policies and making them more accessible and user-friendly. The emergence of generative AI (genAI) offers new opportunities to build privacy assistants that can answer users’ questions about privacy policies. However, genAI’s reliability is a concern due to its potential for producing inaccurate information. This study introduces GenAIPABench, a benchmark for evaluating Generative AI-based Privacy Assistants (GenAIPAs). GenAIPABench includes: 1) A set of curated questions about privacy policies along with annotated answers for various organizations and regulations; 2) Metrics to assess the accuracy, relevance, and consistency of responses; and 3) A tool for generating prompts to introduce privacy policies and paraphrased variants of the curated questions. We evaluated three leading genAI systems—ChatGPT-4, Bard, and Bing AI—using GenAIPABench to gauge their effectiveness as GenAIPAs. Our results demonstrate significant promise in genAI capabilities in the privacy domain while also highlighting challenges in managing complex queries, ensuring consistency, and verifying source accuracy. 
    more » « less
  5. Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling "queries" rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy. 
    more » « less