skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Automated scoring of science explanations for multiple NGSS dimensions and knowledge integration
The Next Generation Science Standards (NGSS) emphasize integrating three dimensions of science learning: disciplinary core ideas, cross-cutting concepts, and science and engineering practices. In this study, we develop formative assessments that measure student understanding of the integration of these three dimensions along with automated scoring methods that distinguish among them. The formative assessments allow students to express their emerging ideas while also capturing progress in integrating core ideas, cross-cutting concepts, and practices. We describe how item and rubric design can work in concert with an automated scoring system to independently score science explanations from multiple perspectives. We describe item design considerations and provide validity evidence for the automated scores.  more » « less
Award ID(s):
1812660
PAR ID:
10184624
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Annual Meeting of the American Educational Research Association (AERA)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Educational research supports incorporating active engagement into K-12 education using authentic STEM experiences. While there are discipline-specific resources to provide students with such experiences, there are limited transdisciplinary opportunities that integrate engineering education and technological skill-building to contextualize core scientific concepts. Here, we present an adaptable module that integrates hands-on technology education and place-based learning to improve student understanding of key chemistry concepts as they relate to local environmental science. The module also supports disciplinary core ideas, practices, and cross-cutting concepts in accordance with the Next Generation Science Standards. We field-tested our module in three different high school courses: Chemistry, Oceanography and Advanced Placement Environmental Science at schools in Washington, USA. Students built spectrophotometric pH sensors using readily available electronic components and calibrated them with known pH reference standards. Students then used their sensors to measure the pH of local environmental water samples. Assessments showed significant improvement in content knowledge in all three courses relating to environmental relevance of pH, and to the design, use and environmental application of sensors. Students also reported increased self-confidence in the material, even when their content knowledge remained the same. These findings suggest that classroom sensor building and collection of environmental data increases student understanding and self-confidence by connecting chemistry concepts to local environmental settings. 
    more » « less
  2. Martin Fred; Norouzi, Narges; Rosenthal, Stephanie (Ed.)
    This paper examines the use of LLMs to support the grading and explanation of short-answer formative assessments in K12 science topics. While significant work has been done on programmatically scoring well-structured student assessments in math and computer science, many of these approaches produce a numerical score and stop short of providing teachers and students with explanations for the assigned scores. In this paper, we investigate few-shot, in-context learning with chain-of-thought reasoning and active learning using GPT-4 for automated assessment of students’ answers in a middle school Earth Science curriculum. Our findings from this human-in-the-loop approach demonstrate success in scoring formative assessment responses and in providing meaningful explanations for the assigned score. We then perform a systematic analysis of the advantages and limitations of our approach. This research provides insight into how we can use human-in-the-loop methods for the continual improvement of automated grading for open-ended science assessments. 
    more » « less
  3. Abstract Argumentation, a key scientific practice presented in theFramework for K-12 Science Education, requires students to construct and critique arguments, but timely evaluation of arguments in large-scale classrooms is challenging. Recent work has shown the potential of automated scoring systems for open response assessments, leveraging machine learning (ML) and artificial intelligence (AI) to aid the scoring of written arguments in complex assessments. Moreover, research has amplified that the features (i.e., complexity, diversity, and structure) of assessment construct are critical to ML scoring accuracy, yet how the assessment construct may be associated with machine scoring accuracy remains unknown. This study investigated how the features associated with the assessment construct of a scientific argumentation assessment item affected machine scoring performance. Specifically, we conceptualized the construct in three dimensions: complexity, diversity, and structure. We employed human experts to code characteristics of the assessment tasks and score middle school student responses to 17 argumentation tasks aligned to three levels of a validated learning progression of scientific argumentation. We randomly selected 361 responses to use as training sets to build machine-learning scoring models for each item. The scoring models yielded a range of agreements with human consensus scores, measured by Cohen’s kappa (mean = 0.60; range 0.38 − 0.89), indicating good to almost perfect performance. We found that higher levels ofComplexityandDiversity of the assessment task were associated with decreased model performance, similarly the relationship between levels ofStructureand model performance showed a somewhat negative linear trend. These findings highlight the importance of considering these construct characteristics when developing ML models for scoring assessments, particularly for higher complexity items and multidimensional assessments. 
    more » « less
  4. Formative assessments can have positive effects on learning, but few exist for computing, even for basic skills such as program tracing. Instead, teachers often rely on overly broad test questions that lack the diagnostic granularity needed to measure early learning. We followed Kane's framework for assessment validity to design a formative assessment of JavaScript program tracing, developing "an argument for effectiveness for a specific use." This included: 1) a fine-grained scoring model to guide practice, 2) item design to test parts of our fine-grained model with low confound-caused variance, 3) a covering test design that samples from a space of items and covers the scoring model, and 4) a feasibility argument for effectiveness for formative use (can target and improve learning). We contribute a distillation of Kane's framework situated for computing education, and a novel application of Kane's framework to formative assessment of program tracing, focusing on scoring, generalization, and use. Our application also contributes a novel way of modeling possible conceptions of a programming language's semantics by modeling prevalent compositions of control flow and data flow graphs and the paths through them, a process for generating test items, and principles for minimizing item confounds. 
    more » « less
  5. With the widespread adoption of the Next Generation Science Standards (NGSS), science teachers and online learning environments face the challenge of evaluating students' integration of different dimensions of science learning. Recent advances in representation learning in natural language processing have proven effective across many natural language processing tasks, but a rigorous evaluation of the relative merits of these methods for scoring complex constructed response formative assessments has not previously been carried out. We present a detailed empirical investigation of feature-based, recurrent neural network, and pre-trained transformer models on scoring content in real-world formative assessment data. We demonstrate that recent neural methods can rival or exceed the performance of feature-based methods. We also provide evidence that different classes of neural models take advantage of different learning cues, and pre-trained transformer models may be more robust to spurious, dataset-specific learning cues, better reflecting scoring rubrics. 
    more » « less