NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Protecting Privacy in Multimodal Large Language Models with MLLMU-Bench

Liu, Zheyuan; Dou, Guangyao; Jia, Mengzhao; Tan, Zhaoxuan; Zeng, Qingkai; Yuan, Yongle; Jiang, Meng (April 2025, Association for Computational Linguistics)

Generative models such as Large Language Models (LLM) and Multimodal Large Language models (MLLMs) trained on massive web corpora can memorize and disclose individuals’ confidential and private data, raising legal and ethical concerns. While many previous works have addressed this issue in LLM via machine unlearning, it remains largely unexplored for MLLMs. To tackle this challenge, we introduce Multimodal Large Language Model Unlearning Benchmark (MLLMU-Bench), a novel benchmark aimed at advancing the understanding of multimodal machine unlearning. MLLMU-Bench consists of 500 fictitious profiles and 153 profiles for public celebrities, each profile feature over 14 customized question-answer pairs, evaluated from both multimodal (image+text) and unimodal (text) perspectives. The benchmark is divided into four sets to assess unlearning algorithms in terms of efficacy, generalizability, and model utility. Finally, we provide baseline results using existing generative model unlearning algorithms. Surprisingly, our experiments show that unimodal unlearning algorithms excel in generation tasks, while multimodal unlearning approaches perform better in classification with multimodal inputs.
more » « less
Free, publicly-accessible full text available April 27, 2026
IHEval: Evaluating Language Models on Following the Instruction Hierarchy

Zhang, Zhihan; Li, Shiyang; Zhang, Zixuan; Liu, Xin; Jiang, Haoming; Tang, Xianfeng; Gao, Yifan; Li, Zheng; Wang, Haodong; Tan, Zhaoxuan; et al (April 2025, Association for Computational Linguistics)
Chiruzzo, Luis; Ritter, Alan; Wang, Lu (Ed.)
The instruction hierarchy, which establishes a priority order from system messages to user messages, conversation history, and tool outputs, is essential for ensuring consistent and safe behavior in language models (LMs). Despite its importance, this topic receives limited attention, and there is a lack of comprehensive benchmarks for evaluating models’ ability to follow the instruction hierarchy. We bridge this gap by introducing IHEval, a novel benchmark comprising 3,538 examples across nine tasks, covering cases where instructions in different priorities either align or conflict. Our evaluation of popular LMs highlights their struggle to recognize instruction priorities. All evaluated models experience a sharp performance decline when facing conflicting instructions, compared to their original instruction-following performance. Moreover, the most competitive open-source model only achieves 48% accuracy in resolving such conflicts. Our results underscore the need for targeted optimization in the future development of LMs.
more » « less
Free, publicly-accessible full text available April 27, 2026
Can LLM Graph Reasoning Generalize beyond Pattern Memorization?

Zhang, Yizhuo; Wang, Heng; Feng, Shangbin; Tan, Zhaoxuan; Han, Xiaochuang; He, Tianxing; Tsvetkov, Yulia (December 2024, EMNLP)

Full Text Available
Chain-of-Layer: Iteratively Prompting Large Language Models for Taxonomy Induction from Limited Examples

https://doi.org/10.1145/3627673.3679608

Zeng, Qingkai; Bai, Yuyang; Tan, Zhaoxuan; Feng, Shangbin; Liang, Zhenwen; Zhang, Zhihan; Jiang, Meng (October 2024, ACM)

Full Text Available
CodeTaxo: Enhancing Taxonomy Expansion with Limited Examples via Code Language Prompts

Zeng, Qingkai Zeng; Bai, Yuyang; Tan, Zhaoxuan; Wu, Zhenyu; Feng, Shangbin; Jiang, Meng (August 2024, arxiv)

Full Text Available
KGQuiz: Evaluating the Generalization of Encoded Knowledge in Large Language Models

https://doi.org/10.1145/3589334.3645623

Bai, Yuyang; Feng, Shangbin; Balachandran, Vidhisha; Tan, Zhaoxuan; Lou, Shiqi; He, Tianxing; Tsvetkov, Yulia (May 2024, ACM)

Large language models (LLMs) demonstrate remarkable performance on knowledge-intensive tasks, suggesting that real-world knowledge is encoded in their model parameters. However, besides explorations on a few probing tasks in limited knowledge domains, it is not well understood how to evaluate LLMs' knowledge systematically and how well their knowledge abilities generalize, across a spectrum of knowledge domains and progressively complex task formats. To this end, we propose KGQuiz, a knowledge-intensive benchmark to comprehensively investigate the knowledge generalization abilities of LLMs. KGQuiz is a scalable framework constructed from triplet-based knowledge, which covers three knowledge domains and consists of five tasks with increasing complexity: true-or-false, multiple-choice QA, blank filling, factual editing, and open-ended knowledge generation. To gain a better understanding of LLMs' knowledge abilities and their generalization, we evaluate 10 open-source and black-box LLMs on the KGQuiz benchmark across the five knowledge-intensive tasks and knowledge domains. Extensive experiments demonstrate that LLMs achieve impressive performance in straightforward knowledge QA tasks, while settings and contexts requiring more complex reasoning or employing domain-specific facts still present significant challenges. We envision KGQuiz as a testbed to analyze such nuanced variations in performance across domains and task formats, and ultimately to understand, evaluate, and improve LLMs' knowledge abilities across a wide spectrum of knowledge domains and tasks.
more » « less
Full Text Available
Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts

https://doi.org/10.18653/v1/2024.emnlp-main.371

Tan, Zhaoxuan; Liu, Zheyuan; Jiang, Meng (January 2024, Association for Computational Linguistics)

Full Text Available
Towards Safer Large Language Models through Machine Unlearning

https://doi.org/10.18653/v1/2024.findings-acl.107

Liu, Zheyuan; Dou, Guangyao; Tan, Zhaoxuan; Tian, Yijun; Jiang, Meng (January 2024, Association for Computational Linguistics)

Full Text Available
Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning

https://doi.org/10.18653/v1/2024.emnlp-main.372

Tan, Zhaoxuan; Zeng, Qingkai; Tian, Yijun; Liu, Zheyuan; Yin, Bing; Jiang, Meng (January 2024, Association for Computational Linguistics)

Full Text Available
Can Language Models Solve Graph Problems in Natural Language?

Wang, Heng; Feng, Shangbin; He, Tianxing; Tan, Zhaoxuan; Han, Xiaochuang; Tsvetkov, Yulia (December 2023, Conference on Neural Information Processing Systems)

Large language models (LLMs) are increasingly adopted for a variety of tasks with implicit graphical structures, such as planning in robotics, multi-hop question answering or knowledge probing, structured commonsense reasoning, and more. While LLMs have advanced the state-of-the-art on these tasks with structure implications, whether LLMs could explicitly process textual descriptions of graphs and structures, map them to grounded conceptual spaces, and perform structured operations remains underexplored. To this end, we propose NLGraph (Natural Language Graph), a comprehensive benchmark of graph-based problem solving designed in natural language. NLGraph contains 29,370 problems, covering eight graph reasoning tasks with varying complexity from simple tasks such as connectivity and shortest path up to complex problems such as maximum flow and simulating graph neural networks. We evaluate LLMs (GPT-3/4) with various prompting approaches on the NLGraph benchmark and find that 1) language models do demonstrate preliminary graph reasoning abilities, 2) the benefit of advanced prompting and in-context learning diminishes on more complex graph problems, while 3) LLMs are also (un)surprisingly brittle in the face of spurious correlations in graph and problem settings. We then propose Build-a-Graph Prompting and Algorithmic Prompting, two instruction-based approaches to enhance LLMs in solving natural language graph problems. Build-a-Graph and Algorithmic prompting improve the performance of LLMs on NLGraph by 3.07% to 16.85% across multiple tasks and settings, while how to solve the most complicated graph reasoning tasks in our setup with language models remains an open research question.
more » « less
Full Text Available

« Prev Next »

Search for: All records