skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on July 27, 2026

Title: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
Most people dislike taking multiple-choice tests, so why are they the default way we evaluate NLP systems? This position paper argues that, despite its simplicity and popularity, multiple-choice evaluation is flawed, both in its format and the datasets it relies on. Drawing from educational testing theory, we propose practical fixes for these issues, helping us build evaluations that better test knowledge and reflect how humans use NLP systems.  more » « less
Award ID(s):
2403436
PAR ID:
10608230
Author(s) / Creator(s):
; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
ISSN:
0736-587X
Format(s):
Medium: X
Location:
Vienna, Austria
Sponsoring Org:
National Science Foundation
More Like this
  1. Hoadley, C; Wang, XC (Ed.)
    Helping students learn how to write is essential. However, students have few opportunities to develop this skill, since giving timely feedback is difficult for teachers. AI applications can provide quick feedback on students’ writing. But, ensuring accurate assessment can be challenging, since students’ writing quality can vary. We examined the impact of students’ writing quality on the error rate of our natural language processing (NLP) system when assessing scientific content in initial and revised design essays. We also explored whether aspects of writing quality were linked to the number of NLP errors. Despite finding that students’ revised essays were significantly different from their initial essays in a few ways, our NLP systems’ accuracy was similar. Further, our multiple regression analyses showed, overall, that students’ writing quality did not impact our NLP systems’ accuracy. This is promising in terms of ensuring students with different writing skills get similarly accurate feedback. 
    more » « less
  2. Recent NLP literature has seen growing interest in improving model interpretability. Along this direction, we propose a trainable neural network layer that learns a global interaction graph between words and then selects more informative words using the learned word interactions. Our layer, we call WIGRAPH, can plug into any neural network-based NLP text classifiers right after its word embedding layer. Across multiple SOTA NLP models and various NLP datasets, we demonstrate that adding the WIGRAPH layer substantially improves NLP models' interpretability and enhances models' prediction performance at the same time. 
    more » « less
  3. We address the problem of entity extraction with a very few examples and address it with an information retrieval approach. Existing extraction approaches consider millions of features extracted from a large number of training data cases. Generally, these data cases are generated by a distant supervision approach with entities in a knowledge base. After that a model is learned and entities are extracted. However, with extremely limited data a ranked list of relevant entities can be helpful to obtain user feedback to get more training data. As Information Retrieval (IR) is a natural choice for ranked list generation, we explore its effectiveness in such a limited data case. To this end, we propose SearchIE, a hybrid of IR and NLP approach that indexes documents represented using handcrafted NLP features. At query time SearchIE samples terms from a Logistic Regression model trained with extremely limited data. We show that SearchIE supersedes state-of-the-art NLP models to find civilians killed by US police officers with even a single civilian name as example. 
    more » « less
  4. Transformer-based language models such as BERT and its variants have found widespread use in natural language processing (NLP). A common way of using these models is to fine-tune them to improve their performance on a specific task. However, it is currently unclear how the fine-tuning process affects the underlying structure of the word embeddings from these models. We present TopoBERT, a visual analytics system for interactively exploring the fine-tuning process of various transformer-based models – across multiple fine-tuning batch updates, subsequent layers of the model, and different NLP tasks – from a topological perspective. The system uses the mapper algorithm from topological data analysis (TDA) to generate a graph that approximates the shape of a model’s embedding space for an input dataset. TopoBERT enables its users (e.g. experts in NLP and linguistics) to (1) interactively explore the fine-tuning process across different model-task pairs, (2) visualize the shape of embedding spaces at multiple scales and layers, and (3) connect linguistic and contextual information about the input dataset with the topology of the embedding space. Using TopoBERT, we provide various use cases to exemplify its applications in exploring fine-tuned word embeddings. We further demonstrate the utility of TopoBERT, which enables users to generate insights about the fine-tuning process and provides support for empirical validation of these insights. 
    more » « less
  5. Artificial Intelligence (AI) and Natural Language Processing (NLP) have become increasingly relevant across multiple fields, creating a necessity for young learners to understand these concepts. However, resources enabling learners to apply AI and NLP, particularly in middle school science, remain limited. To address this gap, we present the early development of NLP4Science, an interactive visualization application facilitating the integration of NLP concepts such as sentiment analysis and keyword extraction into middle school science. We adopted an iterative co-design process starting with a professional development workshop with four teachers, followed by a 2-day pilot study with 48 eighth graders, and concluding with a 5-day study involving 50 sixth graders. This poster presents an overview of NLP4Science, highlighting its key features, and sharing insights gained from the iterative design process, demonstrating the potential of NLP4Science to transform AI and NLP learning within middle school science classrooms. 
    more » « less