skip to main content


Title: Applying machine learning to automatically assess scientific models
Involving students in scientific modeling practice is one of the most effective approaches to achieving the next generation science education learning goals. Given the complexity and multirepresentational features of scientific models, scoring student-developed models is time- and cost-intensive, remaining one of the most challenging assessment practices for science education. More importantly, teachers who rely on timely feedback to plan and adjust instruction are reluctant to use modeling tasks because they could not provide timely feedback to learners. This study utilized machine learn- ing (ML), the most advanced artificial intelligence (AI), to develop an approach to automatically score student- drawn models and their written descriptions of those models. We developed six modeling assessment tasks for middle school students that integrate disciplinary core ideas and crosscutting concepts with the modeling practice. For each task, we asked students to draw a model and write a description of that model, which gave students with diverse backgrounds an opportunity to represent their understanding in multiple ways. We then collected student responses to the six tasks and had human experts score a subset of those responses. We used the human-scored student responses to develop ML algorithmic models (AMs) and to train the computer. Validation using new data suggests that the machine-assigned scores achieved robust agreements with human consent scores. Qualitative analysis of student-drawn models further revealed five characteristics that might impact machine scoring accuracy: Alternative expression, confusing label, inconsistent size, inconsistent position, and redundant information. We argue that these five characteristics should be considered when developing machine-scorable modeling tasks.  more » « less
Award ID(s):
2101104 2100964
NSF-PAR ID:
10348406
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of Research in Science Teaching
ISSN:
0022-4308
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Argumentation, a key scientific practice presented in theFramework for K-12 Science Education, requires students to construct and critique arguments, but timely evaluation of arguments in large-scale classrooms is challenging. Recent work has shown the potential of automated scoring systems for open response assessments, leveraging machine learning (ML) and artificial intelligence (AI) to aid the scoring of written arguments in complex assessments. Moreover, research has amplified that the features (i.e., complexity, diversity, and structure) of assessment construct are critical to ML scoring accuracy, yet how the assessment construct may be associated with machine scoring accuracy remains unknown. This study investigated how the features associated with the assessment construct of a scientific argumentation assessment item affected machine scoring performance. Specifically, we conceptualized the construct in three dimensions: complexity, diversity, and structure. We employed human experts to code characteristics of the assessment tasks and score middle school student responses to 17 argumentation tasks aligned to three levels of a validated learning progression of scientific argumentation. We randomly selected 361 responses to use as training sets to build machine-learning scoring models for each item. The scoring models yielded a range of agreements with human consensus scores, measured by Cohen’s kappa (mean = 0.60; range 0.38 − 0.89), indicating good to almost perfect performance. We found that higher levels ofComplexityandDiversity of the assessment task were associated with decreased model performance, similarly the relationship between levels ofStructureand model performance showed a somewhat negative linear trend. These findings highlight the importance of considering these construct characteristics when developing ML models for scoring assessments, particularly for higher complexity items and multidimensional assessments.

     
    more » « less
  2. Abstract

    Argumentation is fundamental to science education, both as a prominent feature of scientific reasoning and as an effective mode of learning—a perspective reflected in contemporary frameworks and standards. The successful implementation of argumentation in school science, however, requires a paradigm shift in science assessment from the measurement of knowledge and understanding to the measurement of performance and knowledge in use. Performance tasks requiring argumentation must capture the many ways students can construct and evaluate arguments in science, yet such tasks are both expensive and resource‐intensive to score. In this study we explore how machine learning text classification techniques can be applied to develop efficient, valid, and accurate constructed‐response measures of students' competency with written scientific argumentation that are aligned with a validated argumentation learning progression. Data come from 933 middle school students in the San Francisco Bay Area and are based on three sets of argumentation items in three different science contexts. The findings demonstrate that we have been able to develop computer scoring models that can achieve substantial to almost perfect agreement between human‐assigned and computer‐predicted scores. Model performance was slightly weaker for harder items targeting higher levels of the learning progression, largely due to the linguistic complexity of these responses and the sparsity of higher‐level responses in the training data set. Comparing the efficacy of different scoring approaches revealed that breaking down students' arguments into multiple components (e.g., the presence of an accurate claim or providing sufficient evidence), developing computer models for each component, and combining scores from these analytic components into a holistic score produced better results than holistic scoring approaches. However, this analytical approach was found to be differentially biased when scoring responses from English learners (EL) students as compared to responses from non‐EL students on some items. Differences in the severity between human and computer scores for EL between these approaches are explored, and potential sources of bias in automated scoring are discussed.

     
    more » « less
  3. null (Ed.)
    Abstract We systematically compared two coding approaches to generate training datasets for machine learning (ML): (i) a holistic approach based on learning progression levels and (ii) a dichotomous, analytic approach of multiple concepts in student reasoning, deconstructed from holistic rubrics. We evaluated four constructed response assessment items for undergraduate physiology, each targeting five levels of a developing flux learning progression in an ion context. Human-coded datasets were used to train two ML models: (i) an 8-classification algorithm ensemble implemented in the Constructed Response Classifier (CRC), and (ii) a single classification algorithm implemented in LightSide Researcher’s Workbench. Human coding agreement on approximately 700 student responses per item was high for both approaches with Cohen’s kappas ranging from 0.75 to 0.87 on holistic scoring and from 0.78 to 0.89 on analytic composite scoring. ML model performance varied across items and rubric type. For two items, training sets from both coding approaches produced similarly accurate ML models, with differences in Cohen’s kappa between machine and human scores of 0.002 and 0.041. For the other items, ML models trained with analytic coded responses and used for a composite score, achieved better performance as compared to using holistic scores for training, with increases in Cohen’s kappa of 0.043 and 0.117. These items used a more complex scenario involving movement of two ions. It may be that analytic coding is beneficial to unpacking this additional complexity. 
    more » « less
  4. Telehealth technologies play a vital role in delivering quality healthcare to patients regardless of geographic location and health status. Use of telehealth peripherals allow providers a more accurate method of collecting health assessment data from the patient and delivering a more confident and accurate diagnosis, saving not only time and money but creating positive patient outcomes. Advanced Practice Nursing (APN) students should be confident in their ability to diagnose and treat patients through a virtual environment. This pilot simulation was completed to help examine how APN students interacted in a simulation-based education (SBE) experience with and without peripherals, funded by the National Science Foundation’s Future of Work at the Human-Technology Frontier (FW-HTF) program. The SBE experience was created and deployed using the INACSL Healthcare Simulation Standards of Best PracticesTM and vetted by a simulation expert. APN students (N = 24), in their first assessment course, were randomly selected to be either a patient (n = 12) or provider (n = 12) in a telehealth simulation. Student dyads (patient/provider) were randomly placed to complete a scenario with (n = 6 dyads) or without (n = 6 dyads) the use of a peripheral. Students (providers and patients) who completed the SBE experience had an increased confidence level both with and without the use of peripherals. Students evaluated the simulation via the Simulation Effectiveness Tool-Modified (SET-M), and scored their perception of the simulation on a 1 to 5 point Likert Scale. The highest scoring areas were perceived support of learning by the faculty (M=4.6), feeling challenged in decision-making skills (M=4.4), and a better understanding of didactic material (M=4.3). The lowest scoring area was feeling more confident in decision making (M=3.9). We also recorded students’ facial expressions during the task to determine a probability score (0- 100) for expressed basic emotions, and results revealed that students had the highest scores for joy (M = 8.47) and surprise (M = 4.34), followed by disgust (M = 1.43), fear (M = .76), and contempt (M = .64); and had the lowest scores of anger (M = .44) and sadness (M = .36). Students were also asked to complete a reflection assignment as part of the SBE experience. Students reported feeling nervous at the beginning of the SBE experience, but acknowledged feeling better as the SBE experience unfolded. Based on findings from this pilot study, implications point towards the effectiveness of including simulations for nurse practitioner students to increase their confidence in performing telehealth visits and engaging in decision making. For the students, understanding that patients may be just as nervous during telehealth visits was one of the main takeaways from the experience, as well as remembering to reassure the patient and how to ask the patient to work the telehealth equipment. Therefore, providing students opportunities to practice these skills will help increase their confidence, boost their self- and emotion regulation, and improve their decision-making skills in telehealth scenarios. 
    more » « less
  5. null (Ed.)
    Recent work on automated scoring of student responses in educational applications has shown gains in human-machine agreement from neural models, particularly recurrent neural networks (RNNs) and pre-trained transformer (PT) models. However, prior research has neglected investigating the reasons for improvement – in particular, whether models achieve gains for the “right” reasons. Through expert analysis of saliency maps, we analyze the extent to which models attribute importance to words and phrases in student responses that align with question rubrics. We focus on responses to questions that are embedded in science units for middle school students accessed via an online classroom system. RNN and PT models were trained to predict an ordinal score from each response’s text, and experts analyzed generated saliency maps for each response. Our analysis shows that RNN and PT-based models can produce substantially different saliency profiles while often predicting the same scores for the same student responses. While there is some indication that PT models are better able to avoid spurious correlations of high frequency words with scores, results indicate that both models focus on learning statistical correlations between scores and words and do not demonstrate an ability to learn key phrases or longer linguistic units corresponding to ideas, which are targeted by question rubrics. These results point to a need for models to better capture student ideas in educational applications. 
    more » « less