Abstract Argumentation, a key scientific practice presented in theFramework for K-12 Science Education, requires students to construct and critique arguments, but timely evaluation of arguments in large-scale classrooms is challenging. Recent work has shown the potential of automated scoring systems for open response assessments, leveraging machine learning (ML) and artificial intelligence (AI) to aid the scoring of written arguments in complex assessments. Moreover, research has amplified that the features (i.e., complexity, diversity, and structure) of assessment construct are critical to ML scoring accuracy, yet how the assessment construct may be associated with machine scoring accuracy remains unknown. This study investigated how the features associated with the assessment construct of a scientific argumentation assessment item affected machine scoring performance. Specifically, we conceptualized the construct in three dimensions: complexity, diversity, and structure. We employed human experts to code characteristics of the assessment tasks and score middle school student responses to 17 argumentation tasks aligned to three levels of a validated learning progression of scientific argumentation. We randomly selected 361 responses to use as training sets to build machine-learning scoring models for each item. The scoring models yielded a range of agreements with human consensus scores, measured by Cohen’s kappa (mean = 0.60; range 0.38 − 0.89), indicating good to almost perfect performance. We found that higher levels ofComplexityandDiversity of the assessment task were associated with decreased model performance, similarly the relationship between levels ofStructureand model performance showed a somewhat negative linear trend. These findings highlight the importance of considering these construct characteristics when developing ML models for scoring assessments, particularly for higher complexity items and multidimensional assessments.
more »
« less
Utilizing Deep Learning AI to Analyze Scientific Models: Overcoming Challenges
Scientific modeling is a vital educational practice that helps students apply scientific knowledge to real-world phenomena. Despite advances in AI, challenges in accurately assessing such models persist, primarily due to the complexity of cognitive constructs and data imbalances in educational settings. This study addresses these challenges by employing diverse analytic strategies, including the Synthetic Minority Over-sampling Technique (SMOTE), aimed at enhancing fairness and efficacy in automated scoring systems. We analyze the impact of these strategies through a robust methodology, utilizing a combination of tenfold cross-validation and independent testing phases to ensure the reliability of AI assessments. Our findings highlight the effectiveness of deep learning AI in mirroring human judgment, with improvements in accuracy, precision, recall, and F1 scores across varied model assessments. Specifically, the application of SMOTE significantly improved the scoring fairness for minority class instances, which are often underrepresented in educational datasets. This study also delves into the discrepancies between AI and human evaluations, particularly in interpreting creatively expressed student models, which reveals the areas where AI technologies require further enhancements to better align with human evaluative standards. This study lays a foundation for future research to explore advanced AI techniques and training strategies, thus promoting fair and supportive feedback mechanisms that enhance student learning and creativity. By advancing AI applications in science education, this research addresses essential challenges in the automated analysis of complex student responses and supports broader academic goals.
more »
« less
- Award ID(s):
- 2200757
- PAR ID:
- 10589969
- Publisher / Repository:
- Springer Nature
- Date Published:
- Journal Name:
- Journal of Science Education and Technology
- ISSN:
- 1059-0145
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This research explores a novel human-in-the-loop approach that goes beyond traditional prompt engineering approaches to harness Large Language Models (LLMs) with chain-of-thought prompting for grading middle school students’ short answer formative assessments in science and generating useful feedback. While recent efforts have successfully applied LLMs and generative AI to automatically grade assignments in secondary classrooms, the focus has primarily been on providing scores for mathematical and programming problems with little work targeting the generation of actionable insight from the student responses. This paper addresses these limitations by exploring a human-in-the-loop approach to make the process more intuitive and more effective. By incorporating the expertise of educators, this approach seeks to bridge the gap between automated assessment and meaningful educational support in the context of science education for middle school students. We have conducted a preliminary user study, which suggests that (1) co-created models improve the performance of formative feedback generation, and (2) educator insight can be integrated at multiple steps in the process to inform what goes into the model and what comes out. Our findings suggest that in-context learning and human-in-the-loop approaches may provide a scalable approach to automated grading, where the performance of the automated LLM-based grader continually improves over time, while also providing actionable feedback that can support students’ open-ended science learning.more » « less
-
Biased AI models result in unfair decisions. In response, a number of algorithmic solutions have been engineered to mitigate bias, among which the Synthetic Minority Oversampling Technique (SMOTE) has been studied, to an extent. Although the SMOTE technique and its variants have great potentials to help improve fairness, there is little theoretical justification for its success. In addition, formal error and fairness bounds are not clearly given. This paper attempts to address both issues. We prove and demonstrate that synthetic data generated by oversampling underrepresented groups can mitigate algorithmic bias in AI models, while keeping the predictive errors bounded. We further compare this technique to the existing state-of-the-art fair AI techniques on five datasets using a variety of fairness metrics. We show that this approach can effectively improve fairness even when there is a significant amount of label and selection bias, regardless of the baseline AI algorithm.more » « less
-
Rapid advancements in computing have enabled automatic analyses of written texts created in educational settings. The purpose of this symposium is to survey several applications of computerized text analyses used in the research and development of productive learning environments. Four featured research projects have developed or been working on (1) equitable automated scoring models for scientific argumentation for English Language Learners, (2) a real-time, adjustable formative assessment system to promote student revision of uncertaintyinfused scientific arguments, (3) a web-based annotation tool to support student revision of scientific essays, and (4) a new research methodology that analyzes teacher-produced text in online professional development courses. These projects will provide unique insights towards assessment and research opportunities associated with a variety of computerized text analysis approaches.more » « less
-
null (Ed.)Models for automated scoring of content in educational applications continue to demonstrate improvements in human-machine agreement, but it remains to be demonstrated that the models achieve gains for the “right” reasons. For providing reliable scoring and feedback, both high accuracy and connecting scoring decisions to scoring rubrics are crucial. We provide a quantitative and qualitative analysis of automated scoring models for science explanations of middle school students in an online learning environment that leverages saliency maps to explore the reasons for individual model score predictions. Our analysis reveals that top-performing models can arrive at the same predictions for very different reasons, and that current model architectures have difficulty detecting ideas in student responses beyond keywords.more » « less
An official website of the United States government

