skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 1, 2026

Title: Beyond Text-to-SQL for IoT Defense: A Comprehensive Framework for Querying and Classifying IoT Threats
Recognizing the promise of natural language interfaces to databases, prior studies have emphasized the development of text-to-SQL systems. Existing research has generally focused on generating SQL statements from text queries, and the broader challenge lies in inferring new information about the returned data. Our research makes two major contributions to address this gap. First, we introduce a novel Internet-of-Things (IoT) text-to-SQL dataset comprising 10,985 text-SQL pairs and 239,398 rows of network traffic activity. The dataset contains additional query types limited in prior text-to-SQL datasets, notably, temporal-related queries. Our dataset is sourced from a smart building’s IoT ecosystem exploring sensor read and network traffic data. Second, our dataset allows two-stage processing, where the returned data (network traffic) from a generated SQL can be categorized as malicious or not. Our results show that joint training to query and infer information about the data improves overall text-to-SQL performance, nearly matching that of substantially larger models. We also show that current large language models (e.g., GPT3.5) struggle to infer new information about returned data (i.e., they are bad at tabular data understanding), thus our dataset provides a novel test bed for integrating complex domain-specific reasoning into LLMs.  more » « less
Award ID(s):
2145357
PAR ID:
10587598
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ;
Publisher / Repository:
Proceedings of the Workshop on Trustworthy NLP (TrustNLP 2025@NAACL)
Date Published:
Page Range / eLocation ID:
1-12
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Text-to-SQL systems empower users to interact with databases using natural language, automatically translating queries into executable SQL code. However, their reliance on database schema information for SQL generation exposes them to significant security vulnerabilities, particularly schema inference attacks that can lead to unauthorized data access or manipulation. In this paper, we introduce a novel zero-knowledge framework for reconstructing the underlying database schema of text-to-SQL models without any prior knowledge of the database. Our approach systematically probes text-to-SQL models with specially crafted questions and leverages a surrogate GPT-4 model to interpret the outputs, effectively uncovering hidden schema elements—including tables, columns, and data types. We demonstrate that our method achieves high accuracy in reconstructing table names, with F1 scores of up to .99 for generative models and .78 for fine-tuned models, underscoring the severity of schema leakage risks. We also show that our attack can steal prompt information in non-text-to-SQL models. Furthermore, we propose a simple protection mechanism for generative models and empirically show its limitations in mitigating these attacks. 
    more » « less
  2. We introduce ThalamusDB, a novel approximate query processing system that processes complex SQL queries on multi-modal data. ThalamusDB supports SQL queries integrating natural language predicates on visual, audio, and text data. To answer such queries, ThalamusDB exploits a collection of zero-shot models in combination with relational processing. ThalamusDB utilizes deterministic approximate query processing, harnessing the relative efficiency of relational processing to mitigate the computational demands of machine learning inference. For evaluating a natural language predicate, ThalamusDB requests a small number of labels from users. User can specify their preferences on the performance objective regarding the three relevant metrics: approximation error, computation time, and labeling overheads. The ThalamusDB query optimizer chooses optimized plans according to user preferences, prioritizing data processing and requested labels to maximize impact. Experiments with several real-world data sets, taken from Craigslist, YouTube, and Netflix, show that ThalamusDB achieves an average speedup of 35.0x over MindsDB, an exact processing baseline, and outperforms ABAE, a sampling-based method, in 78.9% of cases. 
    more » « less
  3. As IoT device adoption grows, ensuring cybersecurity compliance with IoT standards, like National Institute of Standards and Technology Interagency (NISTIR) 8259A, has become increasingly complex. These standards are typically presented in lengthy, text-based formats that are difficult to process and query automatically. We built a knowledge graph to address this challenge to represent the key concepts, relationships, and references within NISTIR 8259A. We further integrate this knowledge graph with Retrieval-Augmented Generation (RAG) techniques that can be used by large language models (LLMs) to enhance the accuracy and contextual relevance of information retrieval. Additionally, we evaluate the performance of RAG using both graph-based queries and vector database embeddings. Our framework, implemented in Neo4j, was tested using multiple LLMs, including LLAMA2, Mistral-7B, and GPT-4. Our findings show that combining knowledge graphs with RAG significantly improves query precision and contextual relevance compared to unstructured vector-based retrieval methods. While traditional rule-based compliance tools were not evaluated in this study, our results demonstrate the advantages of structured, graph driven querying for security standards like NISTIR 8259A. 
    more » « less
  4. Recent advances in cyber-physical systems, artificial intelligence, and cloud computing have driven the wide deployments of Internet-of-things (IoT) in smart homes. As IoT devices often directly interact with the users and environments, this paper studies if and how we could explore the collective insights from multiple heterogeneous IoT devices to infer user activities for home safety monitoring and assisted living. Specifically, we develop a new system, namely IoTMosaic, to first profile diverse user activities with distinct IoT device event sequences, which are extracted from smart home network traffic based on their TCP/IP data packet signatures. Given the challenges of missing and out-of-order IoT device events due to device malfunctions or varying network and system latencies, IoTMosaic further develops simple yet effective approximate matching algorithms to identify user activities from real-world IoT network traffic. Our experimental results on thousands of user activities in the smart home environment over two months show that our proposed algorithms can infer different user activities from IoT network traffic in smart homes with the overall accuracy, precision, and recall of 0.99, 0.99, and 1.00, respectively. 
    more » « less
  5. Electronic medical records (EMR) contain comprehensive patient information and are typically stored in a relational database with multiple tables. Effective and efficient patient information retrieval from EMR data is a challenging task for medical experts. Question-to-SQL generation methods tackle this problem by first predicting the SQL query for a given question about a database, and then, executing the query on the database. However, most of the existing approaches have not been adapted to the healthcare domain due to a lack of healthcare Question-to-SQL dataset for learning models specific to this domain. In addition, wide use of the abbreviation of terminologies and possible typos in questions introduce additional challenges for accurately generating the corresponding SQL queries. In this paper, we tackle these challenges by developing a deep learning based TRanslate-Edit Model for Question-to-SQL (TREQS) generation, which adapts the widely used sequence-to-sequence model to directly generate the SQL query for a given question, and further performs the required edits using an attentive-copying mechanism and task-specific look-up tables. Based on the widely used publicly available electronic medical database, we create a new large-scale Question-SQL pair dataset, named MIMICSQL, in order to perform the Question-to-SQL generation task in healthcare domain. An extensive set of experiments are conducted to evaluate the performance of our proposed model on MIMICSQL. Both quantitative and qualitative experimental results indicate the flexibility and efficiency of our proposed method in predicting condition values and its robustness to random questions with abbreviations and typos. 
    more » « less