Prior works have led to the development and application of automated assessment methods that leverage machine learning and nat- ural language processing. The performance of these methods have often been reported as being positive, but other prior works have identified aspects on which they may be improved. Particularly in the context of mathematics, the presence of non-linguistic characters and expressions have been identified to contribute to observed model error. In this paper, we build upon this prior work by observing a developed automated as- sessment model for open-response questions in mathematics. We develop a new approach which we call the “Math Term Frequency” (MTF) model to address this issue caused by the presence of non-linguistic terms and ensemble it with the previously-developed assessment model. We observe that the inclusion of this approach notably improves model performance, and present an example of practice of how error analyses can be leveraged to address model limitations.
more »
« less
Enhancing Auto-scoring of Student Open Responses in the Presence of Mathematical Terms and Expressions
Abstract. With the greater application of machine learning models in educational contexts, it is important to understand where such meth- ods perform well as well as how they may be improved. As such, it is important to identify the factors that contribute to prediction error in order to develop targeted methods to enhance model accuracy and mitigate risks of algorithmic bias and unfairness. Prior works have led to the development and application of automated assessment methods that leverage machine learning and natural language processing. The performance of these methods have often been reported as being posi- tive, but other prior works have identified aspects on which they may be improved. Particularly in the context of mathematics, the presence of non-linguistic characters and expressions have been identified to con- tribute to observed model error. In this paper, we build upon this prior work by observing a developed automated assessment model for open- response questions in mathematics. We develop a new approach which we call the “Math Term Frequency” (MTF) model to address this issue caused by the presence of non-linguistic terms and ensemble it with the previously-developed assessment model. We observe that the inclusion of this approach notably improves model performance. Finally, we observe how well this ensembled method extrapolates to student responses in the context of Algorithms, a domain similarly characterized by a large number of non-linguistic terms and expressions. This work represents an example of practice of how error analyses can be leveraged to address model limitations.
more »
« less
- Award ID(s):
- 1822830
- PAR ID:
- 10386534
- Date Published:
- Journal Name:
- AIED 2022: The 23rd International Conference on Artificial Intelligence in Education
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Advancements in online learning platforms have revolutionized education in multiple different ways, transforming the learning experiences and instructional practices. The development of natural language processing and machine learning methods have helped understand and process student languages, comprehend their learning state, and build automated supports for teachers. With this, there has been a growing body of research in developing automated methods to assess students’ work both in mathematical and nonmathematical domains. These automated methods address questions of two categories; closed-ended (with limited correct answers) and open-ended (are often subjective and have multiple correct answers), where open-ended questions are mostly used by teachers to learn about their student’s understanding of a particular concept. Manually assessing and providing feedback to these open-ended questions is often arduous and time-consuming for teachers. For this reason, there have been several works to understand student responses to these open-ended questions to automate the assessment and provide constructive feedback to students. In this research, we seek to improve such a prior method for assessment and feedback suggestions for student open-ended works in mathematics. For this, we present an error analysis of the prior method ”SBERT-Canberra” for auto-scoring, explore various factors that contribute to the error of the method, and propose solutions to improve upon the method by addressing these error factors. We further intend to expand this approach by improving feedback suggestions for teachers to give to their students’ open-ended work.more » « less
-
*Uncertainty expressions* such as ‘probably’ or ‘highly unlikely’ are pervasive in human language. While prior work has established that there is population-level agreement in terms of how humans quantitatively interpret these expressions, there has been little inquiry into the abilities of language models in the same context. In this paper, we investigate how language models map linguistic expressions of uncertainty to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncertainty of another agent about a particular statement, independently of the model’s own certainty about that statement. We find that 7 out of 10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad implications for human-AI and AI-AI communication.more » « less
-
Abstract BackgroundTeachers often rely on the use of open‐ended questions to assess students' conceptual understanding of assigned content. Particularly in the context of mathematics; teachers use these types of questions to gain insight into the processes and strategies adopted by students in solving mathematical problems beyond what is possible through more close‐ended problem types. While these types of problems are valuable to teachers, the variation in student responses to these questions makes it difficult, and time‐consuming, to evaluate and provide directed feedback. It is a well‐studied concept that feedback, both in terms of a numeric score but more importantly in the form of teacher‐authored comments, can help guide students as to how to improve, leading to increased learning. It is for this reason that teachers need better support not only for assessing students' work but also in providing meaningful and directed feedback to students. ObjectivesIn this paper, we seek to develop, evaluate, and examine machine learning models that support automated open response assessment and feedback. MethodsWe build upon the prior research in the automatic assessment of student responses to open‐ended problems and introduce a novel approach that leverages student log data combined with machine learning and natural language processing methods. Utilizing sentence‐level semantic representations of student responses to open‐ended questions, we propose a collaborative filtering‐based approach to both predict student scores as well as recommend appropriate feedback messages for teachers to send to their students. Results and ConclusionWe find that our method outperforms previously published benchmarks across three different metrics for the task of predicting student performance. Through an error analysis, we identify several areas where future works may be able to improve upon our approach.more » « less
-
Abstract As use of artificial intelligence (AI) has increased, concerns about AI bias and discrimination have been growing. This paper discusses an application called PyrEval in which natural language processing (NLP) was used to automate assessment and provide feedback on middle school science writing without linguistic discrimination. Linguistic discrimination in this study was operationalized as unfair assessment of scientific essays based on writing features that are not considered normative such as subject‐verb disagreement. Such unfair assessment is especially problematic when the purpose of assessment is not assessing English writing but rather assessing the content of scientific explanations. PyrEval was implemented in middle school science classrooms. Students explained their roller coaster design by stating relationships among such science concepts as potential energy, kinetic energy and law of conservation of energy. Initial and revised versions of scientific essays written by 307 eighth‐grade students were analyzed. Our manual and NLP assessment comparison analysis showed that PyrEval did not penalize student essays that contained non‐normative writing features. Repeated measures ANOVAs and GLMM analysis results revealed that essay quality significantly improved from initial to revised essays after receiving the NLP feedback, regardless of non‐normative writing features. Findings and implications are discussed. Practitioner notesWhat is already known about this topicAdvancement in AI has created a variety of opportunities in education, including automated assessment, but AI is not bias‐free.Automated writing assessment designed to improve students' scientific explanations has been studied.While limited, some studies reported biased performance of automated writing assessment tools, but without looking into actual linguistic features about which the tools may have discriminated.What this paper addsThis study conducted an actual examination of non‐normative linguistic features in essays written by middle school students to uncover how our NLP tool called PyrEval worked to assess them.PyrEval did not penalize essays containing non‐normative linguistic features.Regardless of non‐normative linguistic features, students' essay quality scores significantly improved from initial to revised essays after receiving feedback from PyrEval. Essay quality improvement was observed regardless of students' prior knowledge, school district and teacher variables.Implications for practice and/or policyThis paper inspires practitioners to attend to linguistic discrimination (re)produced by AI.This paper offers possibilities of using PyrEval as a reflection tool, to which human assessors compare their assessment and discover implicit bias against non‐normative linguistic features.PyrEval is available for use ongithub.com/psunlpgroup/PyrEvalv2.more » « less
An official website of the United States government

