skip to main content


Title: Comparison of Machine Learning Performance Using Analytic and Holistic Coding Approaches Across Constructed Response Assessments Aligned to a Science Learning Progression
Abstract We systematically compared two coding approaches to generate training datasets for machine learning (ML): (i) a holistic approach based on learning progression levels and (ii) a dichotomous, analytic approach of multiple concepts in student reasoning, deconstructed from holistic rubrics. We evaluated four constructed response assessment items for undergraduate physiology, each targeting five levels of a developing flux learning progression in an ion context. Human-coded datasets were used to train two ML models: (i) an 8-classification algorithm ensemble implemented in the Constructed Response Classifier (CRC), and (ii) a single classification algorithm implemented in LightSide Researcher’s Workbench. Human coding agreement on approximately 700 student responses per item was high for both approaches with Cohen’s kappas ranging from 0.75 to 0.87 on holistic scoring and from 0.78 to 0.89 on analytic composite scoring. ML model performance varied across items and rubric type. For two items, training sets from both coding approaches produced similarly accurate ML models, with differences in Cohen’s kappa between machine and human scores of 0.002 and 0.041. For the other items, ML models trained with analytic coded responses and used for a composite score, achieved better performance as compared to using holistic scores for training, with increases in Cohen’s kappa of 0.043 and 0.117. These items used a more complex scenario involving movement of two ions. It may be that analytic coding is beneficial to unpacking this additional complexity.  more » « less
Award ID(s):
1660643 1661263
NSF-PAR ID:
10203032
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Journal of Science Education and Technology
ISSN:
1059-0145
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Argumentation, a key scientific practice presented in theFramework for K-12 Science Education, requires students to construct and critique arguments, but timely evaluation of arguments in large-scale classrooms is challenging. Recent work has shown the potential of automated scoring systems for open response assessments, leveraging machine learning (ML) and artificial intelligence (AI) to aid the scoring of written arguments in complex assessments. Moreover, research has amplified that the features (i.e., complexity, diversity, and structure) of assessment construct are critical to ML scoring accuracy, yet how the assessment construct may be associated with machine scoring accuracy remains unknown. This study investigated how the features associated with the assessment construct of a scientific argumentation assessment item affected machine scoring performance. Specifically, we conceptualized the construct in three dimensions: complexity, diversity, and structure. We employed human experts to code characteristics of the assessment tasks and score middle school student responses to 17 argumentation tasks aligned to three levels of a validated learning progression of scientific argumentation. We randomly selected 361 responses to use as training sets to build machine-learning scoring models for each item. The scoring models yielded a range of agreements with human consensus scores, measured by Cohen’s kappa (mean = 0.60; range 0.38 − 0.89), indicating good to almost perfect performance. We found that higher levels ofComplexityandDiversity of the assessment task were associated with decreased model performance, similarly the relationship between levels ofStructureand model performance showed a somewhat negative linear trend. These findings highlight the importance of considering these construct characteristics when developing ML models for scoring assessments, particularly for higher complexity items and multidimensional assessments.

     
    more » « less
  2. Abstract

    Argumentation is fundamental to science education, both as a prominent feature of scientific reasoning and as an effective mode of learning—a perspective reflected in contemporary frameworks and standards. The successful implementation of argumentation in school science, however, requires a paradigm shift in science assessment from the measurement of knowledge and understanding to the measurement of performance and knowledge in use. Performance tasks requiring argumentation must capture the many ways students can construct and evaluate arguments in science, yet such tasks are both expensive and resource‐intensive to score. In this study we explore how machine learning text classification techniques can be applied to develop efficient, valid, and accurate constructed‐response measures of students' competency with written scientific argumentation that are aligned with a validated argumentation learning progression. Data come from 933 middle school students in the San Francisco Bay Area and are based on three sets of argumentation items in three different science contexts. The findings demonstrate that we have been able to develop computer scoring models that can achieve substantial to almost perfect agreement between human‐assigned and computer‐predicted scores. Model performance was slightly weaker for harder items targeting higher levels of the learning progression, largely due to the linguistic complexity of these responses and the sparsity of higher‐level responses in the training data set. Comparing the efficacy of different scoring approaches revealed that breaking down students' arguments into multiple components (e.g., the presence of an accurate claim or providing sufficient evidence), developing computer models for each component, and combining scores from these analytic components into a holistic score produced better results than holistic scoring approaches. However, this analytical approach was found to be differentially biased when scoring responses from English learners (EL) students as compared to responses from non‐EL students on some items. Differences in the severity between human and computer scores for EL between these approaches are explored, and potential sources of bias in automated scoring are discussed.

     
    more » « less
  3. Abstract

    The most common eye infection in people with diabetes is diabetic retinopathy (DR). It might cause blurred vision or even total blindness. Therefore, it is essential to promote early detection to prevent or alleviate the impact of DR. However, due to the possibility that symptoms may not be noticeable in the early stages of DR, it is difficult for doctors to identify them. Therefore, numerous predictive models based on machine learning (ML) and deep learning (DL) have been developed to determine all stages of DR. However, existing DR classification models cannot classify every DR stage or use a computationally heavy approach. Common metrics such as accuracy, F1 score, precision, recall, and AUC-ROC score are not reliable for assessing DR grading. This is because they do not account for two key factors: the severity of the discrepancy between the assigned and predicted grades and the ordered nature of the DR grading scale. 

    This research proposes computationally efficient ensemble methods for the classification of DR. These methods leverage pre-trained model weights, reducing training time and resource requirements. In addition, data augmentation techniques are used to address data limitations, improve features, and improve generalization. This combination offers a promising approach for accurate and robust DR grading. In particular, we take advantage of transfer learning using models trained on DR data and employ CLAHE for image enhancement and Gaussian blur for noise reduction. We propose a three-layer classifier that incorporates dropout and ReLU activation. This design aims to minimize overfitting while effectively extracting features and assigning DR grades. We prioritize the Quadratic Weighted Kappa (QWK) metric due to its sensitivity to label discrepancies, which is crucial for an accurate diagnosis of DR. This combined approach achieves state-of-the-art QWK scores (0.901, 0.967 and 0.944) in the Eyepacs, Aptos, and Messidor datasets.

     
    more » « less
  4. Involving students in scientific modeling practice is one of the most effective approaches to achieving the next generation science education learning goals. Given the complexity and multirepresentational features of scientific models, scoring student-developed models is time- and cost-intensive, remaining one of the most challenging assessment practices for science education. More importantly, teachers who rely on timely feedback to plan and adjust instruction are reluctant to use modeling tasks because they could not provide timely feedback to learners. This study utilized machine learn- ing (ML), the most advanced artificial intelligence (AI), to develop an approach to automatically score student- drawn models and their written descriptions of those models. We developed six modeling assessment tasks for middle school students that integrate disciplinary core ideas and crosscutting concepts with the modeling practice. For each task, we asked students to draw a model and write a description of that model, which gave students with diverse backgrounds an opportunity to represent their understanding in multiple ways. We then collected student responses to the six tasks and had human experts score a subset of those responses. We used the human-scored student responses to develop ML algorithmic models (AMs) and to train the computer. Validation using new data suggests that the machine-assigned scores achieved robust agreements with human consent scores. Qualitative analysis of student-drawn models further revealed five characteristics that might impact machine scoring accuracy: Alternative expression, confusing label, inconsistent size, inconsistent position, and redundant information. We argue that these five characteristics should be considered when developing machine-scorable modeling tasks. 
    more » « less
  5. Abstract

    The core concept of genetic information flow was identified in recent calls to improve undergraduate biology education. Previous work shows that students have difficulty differentiating between the three processes of the Central Dogma (CD; replication, transcription, and translation). We built upon this work by developing and applying an analytic coding rubric to 1050 student written responses to a three‐question item about the CD. Each response was previously coded only for correctness using a holistic rubric. Our rubric captures subtleties of student conceptual understanding of each process that previous work has not yet captured at a large scale. Regardless of holistic correctness scores, student responses included five or six distinct ideas. By analyzing common co‐occurring rubric categories in student responses, we found a common pair representing two normative ideas about the molecules produced by each CD process. By applying analytic coding to student responses preinstruction and postinstruction, we found student thinking about the processes involved was most prone to change. The combined strengths of analytic and holistic rubrics allow us to reveal mixed ideas about the CD processes and provide a detailed picture of which conceptual ideas students draw upon when explaining each CD process.

     
    more » « less