skip to main content


Search for: All records

Creators/Authors contains: "Zhai, Xiaoming"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available June 2, 2025
  2. Abstract

    Argumentation, a key scientific practice presented in theFramework for K-12 Science Education, requires students to construct and critique arguments, but timely evaluation of arguments in large-scale classrooms is challenging. Recent work has shown the potential of automated scoring systems for open response assessments, leveraging machine learning (ML) and artificial intelligence (AI) to aid the scoring of written arguments in complex assessments. Moreover, research has amplified that the features (i.e., complexity, diversity, and structure) of assessment construct are critical to ML scoring accuracy, yet how the assessment construct may be associated with machine scoring accuracy remains unknown. This study investigated how the features associated with the assessment construct of a scientific argumentation assessment item affected machine scoring performance. Specifically, we conceptualized the construct in three dimensions: complexity, diversity, and structure. We employed human experts to code characteristics of the assessment tasks and score middle school student responses to 17 argumentation tasks aligned to three levels of a validated learning progression of scientific argumentation. We randomly selected 361 responses to use as training sets to build machine-learning scoring models for each item. The scoring models yielded a range of agreements with human consensus scores, measured by Cohen’s kappa (mean = 0.60; range 0.38 − 0.89), indicating good to almost perfect performance. We found that higher levels ofComplexityandDiversity of the assessment task were associated with decreased model performance, similarly the relationship between levels ofStructureand model performance showed a somewhat negative linear trend. These findings highlight the importance of considering these construct characteristics when developing ML models for scoring assessments, particularly for higher complexity items and multidimensional assessments.

     
    more » « less
  3. Abstract

    Most existing diagnostic models are developed to detect whether students have mastered a set of skills of interest, but few have focused on identifying what scientific misconceptions students possess. This article developed a general dual‐purpose model for simultaneously estimating students' overall ability and the presence and absence of misconceptions. The expectation‐maximization algorithm was developed to estimate the model parameters. A simulation study was conducted to evaluate to what extent the parameters can be accurately recovered under varied conditions. A set of real data in science education was also analyzed to examine the viability of the proposed model in practice.

     
    more » « less
  4. Abstract

    In response to Li, Reigh, He, and Miller's commentary,Can we and should we use artificial intelligence for formative assessment in science, we argue that artificial intelligence (AI) is already being widely employed in formative assessment across various educational contexts. While agreeing with Li et al.'s call for further studies on equity issues related to AI, we emphasize the need for science educators to adapt to the AI revolution that has outpaced the research community. We challenge the somewhat restrictive view of formative assessment presented by Li et al., highlighting the significant contributions of AI in providing formative feedback to students, assisting teachers in assessment practices, and aiding in instructional decisions. We contend that AI‐generated scores should not be equated with the entirety of formative assessment practice; no single assessment tool can capture all aspects of student thinking and backgrounds. We address concerns raised by Li et al. regarding AI bias and emphasize the importance of empirical testing and evidence‐based arguments in referring to bias. We assert that AI‐based formative assessment does not necessarily lead to inequity and can, in fact, contribute to more equitable educational experiences. Furthermore, we discuss how AI can facilitate the diversification of representational modalities in assessment practices and highlight the potential benefits of AI in saving teachers’ time and providing them with valuable assessment information. We call for a shift in perspective, from viewing AI as a problem to be solved to recognizing its potential as a collaborative tool in education. We emphasize the need for future research to focus on the effective integration of AI in classrooms, teacher education, and the development of AI systems that can adapt to diverse teaching and learning contexts. We conclude by underlining the importance of addressing AI bias, understanding its implications, and developing guidelines for best practices in AI‐based formative assessment.

     
    more » « less
  5. Involving students in scientific modeling practice is one of the most effective approaches to achieving the next generation science education learning goals. Given the complexity and multirepresentational features of scientific models, scoring student-developed models is time- and cost-intensive, remaining one of the most challenging assessment practices for science education. More importantly, teachers who rely on timely feedback to plan and adjust instruction are reluctant to use modeling tasks because they could not provide timely feedback to learners. This study utilized machine learn- ing (ML), the most advanced artificial intelligence (AI), to develop an approach to automatically score student- drawn models and their written descriptions of those models. We developed six modeling assessment tasks for middle school students that integrate disciplinary core ideas and crosscutting concepts with the modeling practice. For each task, we asked students to draw a model and write a description of that model, which gave students with diverse backgrounds an opportunity to represent their understanding in multiple ways. We then collected student responses to the six tasks and had human experts score a subset of those responses. We used the human-scored student responses to develop ML algorithmic models (AMs) and to train the computer. Validation using new data suggests that the machine-assigned scores achieved robust agreements with human consent scores. Qualitative analysis of student-drawn models further revealed five characteristics that might impact machine scoring accuracy: Alternative expression, confusing label, inconsistent size, inconsistent position, and redundant information. We argue that these five characteristics should be considered when developing machine-scorable modeling tasks. 
    more » « less
  6. Abstract

    Argumentation is fundamental to science education, both as a prominent feature of scientific reasoning and as an effective mode of learning—a perspective reflected in contemporary frameworks and standards. The successful implementation of argumentation in school science, however, requires a paradigm shift in science assessment from the measurement of knowledge and understanding to the measurement of performance and knowledge in use. Performance tasks requiring argumentation must capture the many ways students can construct and evaluate arguments in science, yet such tasks are both expensive and resource‐intensive to score. In this study we explore how machine learning text classification techniques can be applied to develop efficient, valid, and accurate constructed‐response measures of students' competency with written scientific argumentation that are aligned with a validated argumentation learning progression. Data come from 933 middle school students in the San Francisco Bay Area and are based on three sets of argumentation items in three different science contexts. The findings demonstrate that we have been able to develop computer scoring models that can achieve substantial to almost perfect agreement between human‐assigned and computer‐predicted scores. Model performance was slightly weaker for harder items targeting higher levels of the learning progression, largely due to the linguistic complexity of these responses and the sparsity of higher‐level responses in the training data set. Comparing the efficacy of different scoring approaches revealed that breaking down students' arguments into multiple components (e.g., the presence of an accurate claim or providing sufficient evidence), developing computer models for each component, and combining scores from these analytic components into a holistic score produced better results than holistic scoring approaches. However, this analytical approach was found to be differentially biased when scoring responses from English learners (EL) students as compared to responses from non‐EL students on some items. Differences in the severity between human and computer scores for EL between these approaches are explored, and potential sources of bias in automated scoring are discussed.

     
    more » « less
  7. Abstract

    This study develops a framework to conceptualize the use and evolution of machine learning (ML) in science assessment. We systematically reviewed 47 studies that applied ML in science assessment and classified them into five categories: (a) constructed response, (b) essay, (c) simulation, (d) educational game, and (e) inter‐discipline. We compared the ML‐based and conventional science assessments and extracted 12 critical characteristics to map three variables in a three‐dimensional framework:construct,functionality, andautomaticity. The 12 characteristics used to construct a profile for ML‐based science assessments for each article were further analyzed by a two‐step cluster analysis. The clusters identified for each variable were summarized into four levels to illustrate the evolution of each. We further conducted cluster analysis to identify four classes of assessment across the three variables. Based on the analysis, we conclude that ML has transformed—but notyetredefined—conventional science assessment practice in terms of fundamental purpose, the nature of the science assessment, and the relevant assessment challenges. Along with the three‐dimensional framework, we propose five anticipated trends for incorporating ML in science assessment practice for future studies: addressing developmental cognition, changing the process of educational decision making, personalized science learning, borrowing 'good' to advance 'good', and integrating knowledge from other disciplines into science assessment.

     
    more » « less