NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Balepur, Nishant; Rudinger, Rachel; Boyd-Graber, Jordan (July 2025, Association for Computational Linguistics)

Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.
more » « less
Free, publicly-accessible full text available July 27, 2026
GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Sung, Yoo Yeon; Fleisig, Eve; Hope, Yu; Upadhyay, Ishan; Boyd-Graber, Jordan (July 2025, Association for Computational Linguistics)

As AI use becomes more common, it's important to measure not just whether the systems are correct but whether they know when they're incorrect. We propose a new metric to measure this mismatch between correctness and confidence, compare computer ability with human ability, and show that computers have a long way to go before they're well-calibrated.
more » « less
Free, publicly-accessible full text available July 27, 2026
Whose Boat Does it Float? Improving Personalization in Preference Tuning via Inferred User Personas

Balepur, Nishant; Padmakumar, Vishakh; Yang, Fumeng; Feng, Shi; Rudinger, Rachel; Boyd-Graber, Jordan (July 2025, Association for Computational Linguistics)

Language models are optimized to learn which responses you prefer, but they don't learn why you preferred a particular response. This limits their ability to tailor to personalized requests (e.g., "What should I eat for dinner? I'm vegetarian"), so we introduce a simple fix: have models infer personas that explain why users could prefer responses. We show training on these inferred personas leads to responses that are significantly more personalized for user needs.
more » « less
Free, publicly-accessible full text available July 27, 2026
Is your benchmark truly adversarial? AdvScore: Evaluating Human-Grounded Adversarialness

https://doi.org/10.18653/v1/2025.naacl-long.27

Sung, Yoo Yeon; Gor, Maharshi; Fleisig, Eve; Mondal, Ishani; Boyd-Graber, Jordan Lee (January 2025, Association for Computational Linguistics)

Full Text Available
Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can’t Answer?

https://doi.org/10.18653/v1/2025.naacl-short.5

Balepur, Nishant; Gu, Feng; Ravichander, Abhilasha; Feng, Shi; Boyd-Graber, Jordan; Rudinger, Rachel (January 2025, emae)

Language models like ChatGPT are pretty good at answering questions (e.g. "What is 12 * 12?"), but we show they can surprisingly struggle when asked to do the reverse task: generating questions for answers (e.g. "Give me a question with the answer 144"). We study when these errors happen, what might be causing them, and how they can be addressed.
more » « less
Full Text Available
Do great minds think alike? Investigating Human-AI Complementarity in Question Answering with CAIMIRA

https://doi.org/10.18653/v1/2024.emnlp-main.1201

Gor, Maharshi; Daumé_Iii, Hal; Zhou, Tianyi; Boyd-Graber, Jordan Lee (January 2024, Association for Computational Linguistics)

CAIMIRA discovers the skills that humans and AIs use to answer questions. By scraping websites where trivia nerds answer really difficult questions and posing those questions to AI models like GPT-4 and LLaMA-3-70B, while humans excel in knowledge-based abductive reasoning, AI outperforms on fact-based historical recall. This research suggests future challenges should focus on more complex reasoning and nuanced language tasks to better align AI development with human cognitive strengths.
more » « less
Full Text Available
PEDANTS: Cheap but Effective and Interpretable Answer Equivalence

https://doi.org/10.18653/v1/2024.findings-emnlp.548

Li, Zongxia; Mondal, Ishani; Nghiem, Huy; Liang, Yijun; Boyd-Graber, Jordan Lee (January 2024, Association for Computational Linguistics)

Full Text Available
You Make me Feel like a Natural Question: Training QA Systems on Transformed Trivia Questions

https://doi.org/10.18653/v1/2024.emnlp-main.1140

Kabir, Tasnim; Sung, Yoo Yeon; Bandyopadhyay, Saptarashmi; Zou, Hao; Chandra, Abhranil; Boyd-Graber, Jordan Lee (January 2024, Association for Computational Linguistics)

Many of the questions for training AIs how to answer questions come from the queries users type into search engines (like Google's Natural Questions). Is there a cheaper---perhaps even better---way? We propose a "naturalization" technique to turn high-quality, rigorously edited trivia questions into examples that resemble Natural Questions. Training on our naturalized questions and testing on natural questions comes close to the results with using Natural Questions, and we can improve results on MMLU (a standard modern evaluation set) by using our data.
more » « less
Full Text Available
A SMART Mnemonic Sounds like “Glue Tonic”: Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick

https://doi.org/10.18653/v1/2024.emnlp-main.786

Balepur, Nishant; Shu, Matthew; Hoyle, Alexander; Robey, Alison; Feng, Shi; Goldfarb-Tarrant, Seraphina; Boyd-Graber, Jordan Lee (January 2024, Association for Computational Linguistics)

Learning vocabulary (e.g., benevolent) can be tedious, but using mnemonics (e.g., benevolent sounds like "benefits," and a kind boss gives benefits) makes it more engaging and effective. This paper introduces SMART, a large language model trained to produce mnemonics based on feedback from flashcard learners. Students struggle to predict which mnemonics will help them most. Still, by training SMART on both student preferences and learning outcomes, we can generate mnemonics as effectively as GPT-4, but at a much lower cost.
more » « less
Full Text Available
Not all Fake News is Written: A Dataset and Analysis of Misleading Video Headlines

https://doi.org/10.18653/v1/2023.emnlp-main.1010

Sung, Yoo; Boyd-Graber, Jordan; Hassan, Naeemul (January 2023, Association for Computational Linguistics)

Full Text Available

« Prev Next »

Search for: All records