NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Collaborative large language models for automated data extraction in living systematic reviews

https://doi.org/10.1093/jamia/ocae325

Khan, Muhammad Ali; Ayub, Umair; Naqvi, Syed_Arsalan Ahmed; Khakwani, Kaneez_Zahra Rubab; Sipra, Zaryab_bin Riaz; Raina, Ammad; Zhou, Sihan; He, Huan; Saeidi, Amir; Hasan, Bashar; et al (January 2025, Journal of the American Medical Informatics Association)

Abstract ObjectiveData extraction from the published literature is the most laborious step in conducting living systematic reviews (LSRs). We aim to build a generalizable, automated data extraction workflow leveraging large language models (LLMs) that mimics the real-world 2-reviewer process. Materials and MethodsA dataset of 10 trials (22 publications) from a published LSR was used, focusing on 23 variables related to trial, population, and outcomes data. The dataset was split into prompt development (n = 5) and held-out test sets (n = 17). GPT-4-turbo and Claude-3-Opus were used for data extraction. Responses from the 2 LLMs were considered concordant if they were the same for a given variable. The discordant responses from each LLM were provided to the other LLM for cross-critique. Accuracy, ie, the total number of correct responses divided by the total number of responses, was computed to assess performance. ResultsIn the prompt development set, 110 (96%) responses were concordant, achieving an accuracy of 0.99 against the gold standard. In the test set, 342 (87%) responses were concordant. The accuracy of the concordant responses was 0.94. The accuracy of the discordant responses was 0.41 for GPT-4-turbo and 0.50 for Claude-3-Opus. Of the 49 discordant responses, 25 (51%) became concordant after cross-critique, increasing accuracy to 0.76. DiscussionConcordant responses by the LLMs are likely to be accurate. In instances of discordant responses, cross-critique can further increase the accuracy. ConclusionLarge language models, when simulated in a collaborative, 2-reviewer workflow, can extract data with reasonable performance, enabling truly “living” systematic reviews.
more » « less
Free, publicly-accessible full text available January 21, 2026
Privacy and Accuracy-Aware AI/ML Model Deduplication

https://doi.org/10.1145/3725340

Guan, Hong; Yu, Lei; Zhou, Lixi; Xiong, Li; Chowdhury, Kanchan; Xie, Lulu; Xiao, Xusheng; Zou, Jia (June 2025, Proceedings of the ACM on Management of Data)

With the growing adoption of privacy-preserving machine learning algorithms, such as Differentially Private Stochastic Gradient Descent (DP-SGD), training or fine-tuning models on private datasets has become increasingly prevalent. This shift has led to the need for models offering varying privacy guarantees and utility levels to satisfy diverse user requirements. Managing numerous versions of large models introduces significant operational challenges, including increased inference latency, higher resource consumption, and elevated costs. Model deduplication is a technique widely used by many model serving and database systems to support high-performance and low-cost inference queries and model diagnosis queries. However, none of the existing model deduplication works has considered privacy, leading to unbounded aggregation of privacy costs for certain deduplicated models and inefficiencies when applied to deduplicate DP-trained models. We formalize the problem of deduplicating DP-trained models for the first time and propose a novel privacy- and accuracy-aware deduplication mechanism to address the problem. We developed a greedy strategy to select and assign base models to target models to minimize storage and privacy costs. When deduplicating a target model, we dynamically schedule accuracy validations and apply the Sparse Vector Technique to reduce the privacy costs associated with private validation data. Compared to baselines, our approach improved the compression ratio by up to 35× for individual models (including large language models and vision transformers). We also observed up to 43× inference speedup due to the reduction of I/O operations.
more » « less
Free, publicly-accessible full text available June 17, 2026
Declarative Privacy-Preserving Inference Queries

Guan, H; Tiwari, A; Gautier, S; Ambrish, RH; Zhou, L; Wang, Y; Gupta, D; Yang, Y; Xiao, C; Chowdhury, K; et al (May 2025, DASFAA 2025)

Free, publicly-accessible full text available May 24, 2026
ExBoost: Out-of-Box Co-Optimization of Machine Learning and Join Queries

Chowdhury, Kanchan; Xie, Lulu; Zhou, Lixi; Zou; Jia (May 2025, DASFAA 2025)

Free, publicly-accessible full text available May 24, 2026
DATAMORPHER: Automatic Data Transformation using LLM-Based Zero-Shot Code Generation

https://doi.org/10.1109/ICDE65448.2025.00346

Sharma, Ankita; Tandel, Jaykumar; Li, Xuanmao; Wang, Lanjun; Fariha, Anna; Zhang, Liang; Naqvi, Syed_Arsalan_Ahmed; Riaz, Irbaz_Bin; Cao, Lei; Zou, Jia (May 2025, 2025 IEEE 41st International Conference on Data Engineering (ICDE))

Free, publicly-accessible full text available May 7, 2026
Privacy-Preserving Range Aggregation Queries Using a Learning-Based Approach

https://doi.org/10.1109/PerComWorkshops65533.2025.00102

Guan, Hong; Zou, Jia (March 2025, IEEE)

Free, publicly-accessible full text available March 17, 2026
IDNet: A Novel Identity Document Dataset via Few-Shot and Quality-Driven Synthetic Data Generation

https://doi.org/10.1109/BigData62323.2024.10825017

Xie, Lulu; Wang, Yancheng; Guan, Hong; Nag, Soham; Goel, Rajeev; Swamy, Niranjan; Yang, Yingzhen; Xiao, Chaowei; Prisby, Jonathan; Maciejewski, Ross; et al (December 2024, IEEE)

Free, publicly-accessible full text available December 15, 2025
DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup (DOI: 10.1109/ICDE60146.2024.00008)

Zhou, L; Candan, K; Zou, J (May 2024, Proceedings of 2024 IEEE 39th International Conference on Data Engineering (ICDE 2024))

Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel {\em DeepMapping} abstraction, which relies on the impressive {\em memorization} capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding \textit{values} for a given input \textit{key}. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.
more » « less
Full Text Available
DeepMapping: Learned Data Mapping for Lossless Compression and Efficient Lookup

https://doi.org/10.1109/ICDE60146.2024.00008

Zhou, Lixi; Candan, K Selçuk; Zou, Jia (May 2024, IEEE)

Storing tabular data to balance storage and query efficiency is a long-standing research question in the database community. In this work, we argue and show that a novel DeepMapping abstraction, which relies on the impressive memorization capabilities of deep neural networks, can provide better storage cost, better latency, and better run-time memory footprint, all at the same time. Such unique properties may benefit a broad class of use cases in capacity-limited devices. Our proposed DeepMapping abstraction transforms a dataset into multiple key-value mappings and constructs a multi-tasking neural network model that outputs the corresponding values for a given input key. To deal with memorization errors, DeepMapping couples the learned neural network with a lightweight auxiliary data structure capable of correcting mistakes. The auxiliary structure design further enables DeepMapping to efficiently deal with insertions, deletions, and updates even without retraining the mapping. We propose a multi-task search strategy for selecting the hybrid DeepMapping structures (including model architecture and auxiliary structure) with a desirable trade-off among memorization capacity, size, and efficiency. Extensive experiments with a real-world dataset, synthetic and benchmark datasets, including TPC-H and TPC-DS, demonstrated that the DeepMapping approach can better balance the retrieving speed and compression ratio against several cutting-edge competitors.
more » « less
Full Text Available
Serving Deep Learning Models from Relational Databases

https://doi.org/10.48786/edbt.2024.61

Zhou, Lixi; Lin, Qi; Chowdhury, Kanchan; Masood, Saif; Eichenberger, Alexandre; Min, Hong; Sim, Alexander; Wang, Jie; Wang, Yida; Wu, Kesheng; et al (January 2024, OpenProceedings.org)

Serving deep learning (DL) models on relational data has become a critical requirement across diverse commercial and scientific domains, sparking growing interest recently. In this visionary paper, we embark on a comprehensive exploration of representative architectures to address the requirement. We highlight three pivotal paradigms: The state-of-the-art \textit{DL-centric} architecture offloads DL computations to dedicated DL frameworks. The potential \textit{UDF-centric} architecture encapsulates one or more tensor computations into User Defined Functions (UDFs) within the relational database management system (RDBMS). The potential \textit{relation-centric} architecture aims to represent a large-scale tensor computation through relational operators. While each of these architectures demonstrates promise in specific use scenarios, we identify urgent requirements for seamless integration of these architectures and the middle ground in-between these architectures. We delve into the gaps that impede the integration and explore innovative strategies to close them. We present a pathway to establish a novel RDBMS for enabling a broad class of data-intensive DL inference applications.
more » « less

« Prev Next »

Search for: All records