Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Balepur, Nishant; Rudinger, Rachel; Boyd-Graber, Jordan

Citation Details

This content will become publicly available on July 27, 2026

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above

Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems. more »

Award ID(s):: 2403436

PAR ID:: 10608230

Author(s) / Creator(s):: Balepur, Nishant; Rudinger, Rachel; Boyd-Graber, Jordan

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2025-07-27

ISSN:: 0736-587X

Format(s):: Medium: X

Location:: Vienna, Austria

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on July 27, 2026
Conference Proceeding:
The DOI is not currently available.

More Like this