skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Are Red Roses Red? Evaluating Consistency of Question-Answering Models
Although current evaluation of question-answering systems treats predictions in isolation, we need to consider the relationship between predictions to measure true understanding. A model should be penalized for answering “no” to “Is the rose red?” if it answers “red” to “What color is the rose?”. We propose a method to automatically extract such implications for instances from two QA datasets, VQA and SQuAD, which we then use to evaluate the consistency of models. Human evaluation shows these generated implications are well formed and valid. Consistency evaluation provides crucial insights into gaps in existing models, while retraining with implication-augmented data improves consistency on both synthetic and human-generated implications.  more » « less
Award ID(s):
1756023
PAR ID:
10100480
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Association for Computational Linguistics (ACL)
Page Range / eLocation ID:
6174 to 6184
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Rambow, Owen; Wanner, Leo; Apidianaki, Marianna; Khalifa, Hend; Eugenio, Barbara; Schockaert, Steven (Ed.)
    We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI’s GPT-3.5 Turbo and Meta’s Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field. 
    more » « less
  2. We propose a novel framework that leverages Visual Question Answering (VQA) models to automate the evaluation of LLM-generated data visualizations. Traditional evaluation methods often rely on human judgment, which is costly and unscalable, or focus solely on data accuracy, neglecting the effectiveness of visual communication. By employing VQA models, we assess data representation quality and the general communicative clarity of charts. Experiments were conducted using two leading VQA benchmark datasets, ChartQA and PlotQA, with visualizations generated by OpenAI’s GPT-3.5 Turbo and Meta’s Llama 3.1 70B-Instruct models. Our results indicate that LLM-generated charts do not match the accuracy of the original non-LLM-generated charts based on VQA performance measures. Moreover, while our results demonstrate that few-shot prompting significantly boosts the accuracy of chart generation, considerable progress remains to be made before LLMs can fully match the precision of human-generated graphs. This underscores the importance of our work, which expedites the research process by enabling rapid iteration without the need for human annotation, thus accelerating advancements in this field. 
    more » « less
  3. Adversarial evaluation stress-tests a model’s understanding of natural language. Because past approaches expose superficial patterns, the resulting adversarial examples are limited in complexity and diversity. We propose human- in-the-loop adversarial generation, where human authors are guided to break models. We aid the authors with interpretations of model predictions through an interactive user interface. We apply this generation framework to a question answering task called Quizbowl, where trivia enthusiasts craft adversarial questions. The resulting questions are validated via live human–computer matches: Although the questions appear ordinary to humans, they systematically stump neural and information retrieval models. The adversarial questions cover diverse phenomena from multi-hop reasoning to entity type distractors, exposing open challenges in robust question answering. 
    more » « less
  4. Our work with teams funded through the National Science Foundation REvolutionizing Engineering and Computer Science Departments (RED) program began in 2015. Our project—funded first by a NSF EAGER grant, and then by a NSF RFE grant—focuses on understanding how the RED teams make change on their campuses and how this information about change can be captured and communicated to other STEM programs that seek to make change happen. Because our RED Participatory Action Research (REDPAR) Project is a collaboration between researchers (Center for Evaluation & Research for STEM Equity at the University of Washington) and practitioners (Making Academic Change Happen Workshop at Rose-Hulman Institute of Technology), we have challenged ourselves to develop means of communication that allow for both aspects of the work—both research and practice—to be treated equitably. As a result, we have created a new dissemination channel—the RED Participatory Action Project Tipsheet. The tipsheet format accomplishes several important goals. First, the content is drawn from both the research conducted with the RED teams and the practitioners’ work with the teams. Each tipsheet takes up a single theme and grounds the theme in the research literature while offering practical tips for applying the information. Second, the format is accessible to a wide spectrum of potential users, remaining free of jargon and applicable to multiple program and departmental contexts. Third, by publishing the tipsheets ourselves, rather than submitting them to an engineering education research journal, we make the information timely and freely available. We can make a tipsheet as soon as a theme emerges from the intersection of research data and observations of practice. Permalink: https://peer.asee.org/32275. 
    more » « less
  5. Our work with teams funded through the National Science Foundation REvolutionizing Engineering and Computer Science Departments (RED) program began in 2015. Our project—funded first by a NSF EAGER grant, and then by a NSF RFE grant—focuses on understanding how the RED teams make change on their campuses and how this information about change can be captured and communicated to other STEM programs that seek to make change happen. Because our RED Participatory Action Research (REDPAR) Project is a collaboration between researchers (Center for Evaluation & Research for STEM Equity at the University of Washington) and practitioners (Making Academic Change Happen Workshop at Rose-Hulman Institute of Technology), we have challenged ourselves to develop means of communication that allow for both aspects of the work—both research and practice—to be treated equitably. As a result, we have created a new dissemination channel—the RED Participatory Action Project Tipsheet. The tipsheet format accomplishes several important goals. First, the content is drawn from both the research conducted with the RED teams and the practitioners’ work with the teams. Each tipsheet takes up a single theme and grounds the theme in the research literature while offering practical tips for applying the information. Second, the format is accessible to a wide spectrum of potential users, remaining free of jargon and applicable to multiple program and departmental contexts. Third, by publishing the tipsheets ourselves, rather than submitting them to an engineering education research journal, we make the information timely and freely available. We can make a tipsheet as soon as a theme emerges from the intersection of research data and observations of practice. During the poster session at ASEE 2019, we will share the three REDPAR Tipsheets that have been produced thus far: Creating Strategic Partnerships, Communicating Change, and Shared Vision. We will also work with attendees to demonstrate how the tipsheet content is adaptable to the attendees’ specific academic context. Our goal for the poster session is to provide attendees with tipsheet resources that are useful to their specific change project. 
    more » « less