skip to main content

This content will become publicly available on March 1, 2023

Title: Applying machine learning to automatically assess scientific models
Involving students in scientific modeling practice is one of the most effective approaches to achieving the next generation science education learning goals. Given the complexity and multirepresentational features of scientific models, scoring student-developed models is time- and cost-intensive, remaining one of the most challenging assessment practices for science education. More importantly, teachers who rely on timely feedback to plan and adjust instruction are reluctant to use modeling tasks because they could not provide timely feedback to learners. This study utilized machine learn- ing (ML), the most advanced artificial intelligence (AI), to develop an approach to automatically score student- drawn models and their written descriptions of those models. We developed six modeling assessment tasks for middle school students that integrate disciplinary core ideas and crosscutting concepts with the modeling practice. For each task, we asked students to draw a model and write a description of that model, which gave students with diverse backgrounds an opportunity to represent their understanding in multiple ways. We then collected student responses to the six tasks and had human experts score a subset of those responses. We used the human-scored student responses to develop ML algorithmic models (AMs) and to train the computer. Validation using new more » data suggests that the machine-assigned scores achieved robust agreements with human consent scores. Qualitative analysis of student-drawn models further revealed five characteristics that might impact machine scoring accuracy: Alternative expression, confusing label, inconsistent size, inconsistent position, and redundant information. We argue that these five characteristics should be considered when developing machine-scorable modeling tasks. « less
Authors:
; ;
Award ID(s):
2101104 2100964
Publication Date:
NSF-PAR ID:
10348406
Journal Name:
Journal of Research in Science Teaching
ISSN:
0022-4308
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract We systematically compared two coding approaches to generate training datasets for machine learning (ML): (i) a holistic approach based on learning progression levels and (ii) a dichotomous, analytic approach of multiple concepts in student reasoning, deconstructed from holistic rubrics. We evaluated four constructed response assessment items for undergraduate physiology, each targeting five levels of a developing flux learning progression in an ion context. Human-coded datasets were used to train two ML models: (i) an 8-classification algorithm ensemble implemented in the Constructed Response Classifier (CRC), and (ii) a single classification algorithm implemented in LightSide Researcher’s Workbench. Human coding agreement on approximately 700 student responses per item was high for both approaches with Cohen’s kappas ranging from 0.75 to 0.87 on holistic scoring and from 0.78 to 0.89 on analytic composite scoring. ML model performance varied across items and rubric type. For two items, training sets from both coding approaches produced similarly accurate ML models, with differences in Cohen’s kappa between machine and human scores of 0.002 and 0.041. For the other items, ML models trained with analytic coded responses and used for a composite score, achieved better performance as compared to using holistic scores for training, with increases in Cohen’smore »kappa of 0.043 and 0.117. These items used a more complex scenario involving movement of two ions. It may be that analytic coding is beneficial to unpacking this additional complexity.« less
  2. Recent work on automated scoring of student responses in educational applications has shown gains in human-machine agreement from neural models, particularly recurrent neural networks (RNNs) and pre-trained transformer (PT) models. However, prior research has neglected investigating the reasons for improvement – in particular, whether models achieve gains for the “right” reasons. Through expert analysis of saliency maps, we analyze the extent to which models attribute importance to words and phrases in student responses that align with question rubrics. We focus on responses to questions that are embedded in science units for middle school students accessed via an online classroom system. RNN and PT models were trained to predict an ordinal score from each response’s text, and experts analyzed generated saliency maps for each response. Our analysis shows that RNN and PT-based models can produce substantially different saliency profiles while often predicting the same scores for the same student responses. While there is some indication that PT models are better able to avoid spurious correlations of high frequency words with scores, results indicate that both models focus on learning statistical correlations between scores and words and do not demonstrate an ability to learn key phrases or longer linguistic units corresponding tomore »ideas, which are targeted by question rubrics. These results point to a need for models to better capture student ideas in educational applications.« less
  3. Models for automated scoring of content in educational applications continue to demonstrate improvements in human-machine agreement, but it remains to be demonstrated that the models achieve gains for the “right” reasons. For providing reliable scoring and feedback, both high accuracy and connecting scoring decisions to scoring rubrics are crucial. We provide a quantitative and qualitative analysis of automated scoring models for science explanations of middle school students in an online learning environment that leverages saliency maps to explore the reasons for individual model score predictions. Our analysis reveals that top-performing models can arrive at the same predictions for very different reasons, and that current model architectures have difficulty detecting ideas in student responses beyond keywords.
  4. In teaching mechanics, we use multiple representations of vectors to develop concepts and analysis techniques. These representations include pictorials, diagrams, symbols, numbers and narrative language. Through years of study as students, researchers, and teachers, we develop a fluency rooted in a deep conceptual understanding of what each representation communicates. Many novice learners, however, struggle to gain such understanding and rely on superficial mimicry of the problem solving procedures we demonstrate in examples. The term representational competence refers to the ability to interpret, switch between, and use multiple representations of a concept as appropriate for learning, communication and analysis. In engineering statics, an understanding of what each vector representation communicates and how to use different representations in problem solving is important to the development of both conceptual and procedural knowledge. Science education literature identifies representational competence as a marker of true conceptual understanding. This paper presents development work for a new assessment instrument designed to measure representational competence with vectors in an engineering mechanics context. We developed the assessment over two successive terms in statics courses at a community college, a medium-sized regional university, and a large state university. We started with twelve multiple-choice questions that survey the vector representations commonlymore »employed in statics. Each question requires the student to interpret and/or use two or more different representations of vectors and requires no calculation beyond single digit integer arithmetic. Distractor answer choices include common student mistakes and misconceptions drawn from the literature and from our teaching experience. We piloted these twelve questions as a timed section of the first exam in fall 2018 statics courses at both Whatcom Community College (WCC) and Western Washington University. Analysis of students’ unprompted use of vector representations on the open-ended problem-solving section of the same exam provides evidence of the assessment’s validity as a measurement instrument for representational competence. We found a positive correlation between students’ accurate and effective use of representations and their score on the multiple choice test. We gathered additional validity evidence by reviewing student responses on an exam wrapper reflection. We used item difficulty and item discrimination scores (point-biserial correlation) to eliminate two questions and revised the remaining questions to improve clarity and discriminatory power. We administered the revised version in two contexts: (1) again as part of the first exam in the winter 2019 Statics course at WCC, and (2) as an extra credit opportunity for statics students at Utah State University. This paper includes sample questions from the assessment to illustrate the approach. The full assessment is available to interested instructors and researchers through an online tool.« less
  5. In teaching mechanics, we use multiple representations of vectors to develop concepts and analysis techniques. These representations include pictorials, diagrams, symbols, numbers and narrative language. Through years of study as students, researchers, and teachers, we develop a fluency rooted in a deep conceptual understanding of what each representation communicates. Many novice learners, however, struggle to gain such understanding and rely on superficial mimicry of the problem solving procedures we demonstrate in examples. The term representational competence refers to the ability to interpret, switch between, and use multiple representations of a concept as appropriate for learning, communication and analysis. In engineering statics, an understanding of what each vector representation communicates and how to use different representations in problem solving is important to the development of both conceptual and procedural knowledge. Science education literature identifies representational competence as a marker of true conceptual understanding. This paper presents development work for a new assessment instrument designed to measure representational competence with vectors in an engineering mechanics context. We developed the assessment over two successive terms in statics courses at a community college, a medium-sized regional university, and a large state university. We started with twelve multiple-choice questions that survey the vector representations commonlymore »employed in statics. Each question requires the student to interpret and/or use two or more different representations of vectors and requires no calculation beyond single digit integer arithmetic. Distractor answer choices include common student mistakes and misconceptions drawn from the literature and from our teaching experience. We piloted these twelve questions as a timed section of the first exam in fall 2018 statics courses at both Whatcom Community College (WCC) and Western Washington University. Analysis of students’ unprompted use of vector representations on the open-ended problem-solving section of the same exam provides evidence of the assessment’s validity as a measurement instrument for representational competence. We found a positive correlation between students’ accurate and effective use of representations and their score on the multiple choice test. We gathered additional validity evidence by reviewing student responses on an exam wrapper reflection. We used item difficulty and item discrimination scores (point-biserial correlation) to eliminate two questions and revised the remaining questions to improve clarity and discriminatory power. We administered the revised version in two contexts: (1) again as part of the first exam in the winter 2019 Statics course at WCC, and (2) as an extra credit opportunity for statics students at Utah State University. This paper includes sample questions from the assessment to illustrate the approach. The full assessment is available to interested instructors and researchers through an online tool.« less