skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, December 11 until 2:00 AM ET on Friday, December 12 due to maintenance. We apologize for the inconvenience.


Title: Generating Succinct Descriptions of Database Schemata for Cost-Efficient Prompting of Large Language Models
Using large language models (LLMs) for tasks like text-to-SQL translation often requires describing the database schema as part of the model input. LLM providers typically charge as a function of the number of tokens read. Hence, reducing the length of the schema description saves money at each model invocation. This paper introduces Schemonic, a system that automatically finds concise text descriptions of relational database schemata. By introducing abbreviations or grouping schema elements with similar properties, Schemonic typically finds descriptions that use significantly fewer tokens than naive schema representations. Internally, Schemonic models schema compression as a combinatorial optimization problem and uses integer linear programming solvers to find guaranteed optimal or near-optimal solutions. It speeds up optimization by starting optimization from heuristic solutions and reducing the search space size via pre-processing. The experiments on TPC-H, SPIDER, and Public-BI demonstrate that Schemonic reduces schema description length significantly, along with fees for reading them, without reducing the accuracy in tasks such as text-to-SQL translation.  more » « less
Award ID(s):
2239326
PAR ID:
10577784
Author(s) / Creator(s):
Publisher / Repository:
VLDB Endowment
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
17
Issue:
11
ISSN:
2150-8097
Page Range / eLocation ID:
3511 to 3523
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Text-to-SQL systems empower users to interact with databases using natural language, automatically translating queries into executable SQL code. However, their reliance on database schema information for SQL generation exposes them to significant security vulnerabilities, particularly schema inference attacks that can lead to unauthorized data access or manipulation. In this paper, we introduce a novel zero-knowledge framework for reconstructing the underlying database schema of text-to-SQL models without any prior knowledge of the database. Our approach systematically probes text-to-SQL models with specially crafted questions and leverages a surrogate GPT-4 model to interpret the outputs, effectively uncovering hidden schema elements—including tables, columns, and data types. We demonstrate that our method achieves high accuracy in reconstructing table names, with F1 scores of up to .99 for generative models and .78 for fine-tuned models, underscoring the severity of schema leakage risks. We also show that our attack can steal prompt information in non-text-to-SQL models. Furthermore, we propose a simple protection mechanism for generative models and empirically show its limitations in mitigating these attacks. 
    more » « less
  2. State-of-the-art Text-to-SQL models rely on fine-tuning or few-shot prompting to help LLMs learn from training datasets containing mappings from natural language (NL) queries to SQL statements. Consequently, the quality of the dataset can greatly affect the accuracy of these Text-to-SQL models. Unlike other NL tasks, Text-to-SQL datasets are prone to errors despite extensive manual efforts due to the subtle semantics of SQL. Our study has found a non-negligible (>30%) portion of incorrect NL to SQL mapping cases exists in popular datasets Spider and BIRD. This paper aims to improve the quality of Text-to-SQL training datasets and thereby increase the accuracy of the resulting models. To do so, we propose a necessary correctness condition called execution consistency. For a given database instance, an NL to SQL mapping satisfies execution consistency if the execution result of an NL query matches that of the corresponding SQL. We develop SQLDriller to detect incorrect NL to SQL mappings based on execution consistency in a best-effort manner by crafting database instances that likely result in violations of execution consistency. It generates multiple candidate SQL predictions that differ in their syntax structures. Using a SQL equivalence checker, SQLDriller obtains counterexample database instances that can distinguish non-equivalent candidate SQLs. It then checks the execution consistency of an NL to SQL mapping under this set of counterexamples. The evaluation shows SQLDriller effectively detects and fixes incorrect mappings in the Text-to-SQL dataset, and it improves the model accuracy by up to 13.6%. 
    more » « less
  3. A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text. We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well. 
    more » « less
  4. Reward signals in reinforcement learning are expensive to design and often require access to the true state which is not available in the real world. Common alternatives are usually demonstrations or goal images which can be labor-intensive to collect. On the other hand, text descriptions provide a general, natural, and low-effort way of communicating the desired task. However, prior works in learning text-conditioned policies still rely on rewards that are defined using either true state or labeled expert demonstrations. We use recent developments in building large-scale visuolanguage models like CLIP to devise a framework that generates the task reward signal just from goal text description and raw pixel observations which is then used to learn the task policy. We evaluate the proposed framework on control and robotic manipulation tasks. Finally, we distill the individual task policies into a single goal text conditioned policy that can generalize in a zero-shot manner to new tasks with unseen objects and unseen goal text descriptions. 
    more » « less
  5. Roy, Sudeepa; Kara, Ahmet (Ed.)
    Decision makers in a broad range of domains, such as finance, transportation, manufacturing, and healthcare, often need to derive optimal decisions given a set of constraints and objectives. Traditional solutions to such constrained optimization problems are typically application-specific, complex, and do not generalize. Further, the usual workflow requires slow, cumbersome, and error-prone data movement between a database, and predictive-modeling and optimization packages. All of these problems are exacerbated by the unprecedented size of modern data-intensive optimization problems. The emerging research area of in-database prescriptive analytics aims to provide seamless domain-independent, declarative, and scalable approaches powered by the system where the data typically resides: the database. Integrating optimization with database technology opens up prescriptive analytics to a much broader community, amplifying its benefits. We discuss how deep integration between the DBMS, predictive models, and optimization software creates opportunities for rich prescriptive-query functionality with good scalability and performance. Summarizing some of our main results and ongoing work in this area, we highlight challenges related to usability, scalability, data uncertainty, and dynamic environments, and argue that perspectives from data management research can drive novel strategies and solutions. 
    more » « less