- Award ID(s):
- 1812660
- NSF-PAR ID:
- 10286230
- Date Published:
- Journal Name:
- New York Academy of Sciences Natural Language, Dialog and Speech Symposium
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)Models for automated scoring of content in educational applications continue to demonstrate improvements in human-machine agreement, but it remains to be demonstrated that the models achieve gains for the “right” reasons. For providing reliable scoring and feedback, both high accuracy and connecting scoring decisions to scoring rubrics are crucial. We provide a quantitative and qualitative analysis of automated scoring models for science explanations of middle school students in an online learning environment that leverages saliency maps to explore the reasons for individual model score predictions. Our analysis reveals that top-performing models can arrive at the same predictions for very different reasons, and that current model architectures have difficulty detecting ideas in student responses beyond keywords.more » « less
-
null (Ed.)Models for automated scoring of content in educational applications continue to demonstrate improvements in human-machine agreement, but it remains to be demonstrated that the models achieve gains for the “right” reasons. For providing reliable scoring and feedback, both high accuracy and construct coverage are crucial. In this work, we provide an in-depth quantitative and qualitative analysis of automated scoring models for science explanations of middle school students in an online learning environment that leverages saliency maps to explore the reasons for individual model score predictions. Our analysis reveals that top-performing models can arrive at the same predictions for very different reasons, and that current model architectures have difficulty detecting ideas in student responses beyond keywords.more » « less
-
Social signal processing algorithms have become increasingly better at solving well-defined prediction and estimation problems in audiovisual recordings of group discussion. However, much human behavior and communication is less structured and more subtle. In this paper, we address the problem of generic question answering from diverse audiovisual recordings of human interaction. The goal is to select the correct free-text answer to a free-text question about human interaction in a video. We propose an RNN-based model with two novel ideas: a temporal attention module that highlights key words and phrases in the question and candidate answers, and a consistency measurement module that scores the similarity between the multimodal data, the question, and the candidate answers. This small set of consistency scores forms the input to the final question-answering stage, resulting in a lightweight model. We demonstrate that our model achieves state of the art accuracy on the Social-IQ dataset containing hundreds of videos and question/answer pairs.more » « less
-
Abstract The core concept of genetic information flow was identified in recent calls to improve undergraduate biology education. Previous work shows that students have difficulty differentiating between the three processes of the Central Dogma (CD; replication, transcription, and translation). We built upon this work by developing and applying an analytic coding rubric to 1050 student written responses to a three‐question item about the CD. Each response was previously coded only for correctness using a holistic rubric. Our rubric captures subtleties of student conceptual understanding of each process that previous work has not yet captured at a large scale. Regardless of holistic correctness scores, student responses included five or six distinct ideas. By analyzing common co‐occurring rubric categories in student responses, we found a common pair representing two normative ideas about the molecules produced by each CD process. By applying analytic coding to student responses preinstruction and postinstruction, we found student thinking about the processes involved was most prone to change. The combined strengths of analytic and holistic rubrics allow us to reveal mixed ideas about the CD processes and provide a detailed picture of which conceptual ideas students draw upon when explaining each CD process.
-
Gradient-based analysis methods, such as saliency map visualizations and adversarial input perturbations, have found widespread use in interpreting neural NLP models due to their simplicity, flexibility, and most importantly, the fact that they directly reflect the model internals. In this paper, however, we demonstrate that the gradients of a model are easily manipulable, and thus bring into question the reliability of gradient-based analyses. In particular, we merge the layers of a target model with a Facade Model that overwhelms the gradients without affecting the predictions. This Facade Model can be trained to have gradients that are misleading and irrelevant to the task, such as focusing only on the stop words in the input. On a variety of NLP tasks (sentiment analysis, NLI, and QA), we show that the merged model effectively fools different analysis tools: saliency maps differ significantly from the original model’s, input reduction keeps more irrelevant input tokens, and adversarial perturbations identify unimportant tokens as being highly important.more » « less