skip to main content


Title: How do middle school students think about climate change?
We use natural language processing (NLP) to train an automated scoring model to assess students’ reasoning on how to slow climate change. We use the insights from scoring over 1000 explanations to design a knowledge integration intervention and test it in three classrooms. The intervention supported students to distinguish relevant evidence, improving connections between ideas in a revised explanation. We discuss next steps for using the NLP model to support teachers and students in classrooms.  more » « less
Award ID(s):
1813713
NSF-PAR ID:
10342436
Author(s) / Creator(s):
; ; ;
Editor(s):
Chinn, C; Tan, E; Chan, C; Kali, Y.
Date Published:
Journal Name:
Proceedings of the 16th International Conference of the Learning Sciences—ICLS 2022
Page Range / eLocation ID:
2198-2199
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Argumentation, a key scientific practice presented in theFramework for K-12 Science Education, requires students to construct and critique arguments, but timely evaluation of arguments in large-scale classrooms is challenging. Recent work has shown the potential of automated scoring systems for open response assessments, leveraging machine learning (ML) and artificial intelligence (AI) to aid the scoring of written arguments in complex assessments. Moreover, research has amplified that the features (i.e., complexity, diversity, and structure) of assessment construct are critical to ML scoring accuracy, yet how the assessment construct may be associated with machine scoring accuracy remains unknown. This study investigated how the features associated with the assessment construct of a scientific argumentation assessment item affected machine scoring performance. Specifically, we conceptualized the construct in three dimensions: complexity, diversity, and structure. We employed human experts to code characteristics of the assessment tasks and score middle school student responses to 17 argumentation tasks aligned to three levels of a validated learning progression of scientific argumentation. We randomly selected 361 responses to use as training sets to build machine-learning scoring models for each item. The scoring models yielded a range of agreements with human consensus scores, measured by Cohen’s kappa (mean = 0.60; range 0.38 − 0.89), indicating good to almost perfect performance. We found that higher levels ofComplexityandDiversity of the assessment task were associated with decreased model performance, similarly the relationship between levels ofStructureand model performance showed a somewhat negative linear trend. These findings highlight the importance of considering these construct characteristics when developing ML models for scoring assessments, particularly for higher complexity items and multidimensional assessments.

     
    more » « less
  2. The current study examined the neural correlates of spatial rotation in eight engineering undergraduates. Mastering engineering graphics requires students to mentally visualize in 3D and mentally rotate parts when developing 2D drawings. Students’ spatial rotation skills play a significant role in learning and mastering engineering graphics. Traditionally, the assessment of students’ spatial skills involves no measurements of neural activity during student performance of spatial rotation tasks. We used electroencephalography (EEG) to record neural activity while students performed the Revised Purdue Spatial Visualization Test: Visualization of Rotations (Revised PSVT:R). The two main objectives were to 1) determine whether high versus low performers on the Revised PSVT:R show differences in EEG oscillations and 2) identify EEG oscillatory frequency bands sensitive to item difficulty on the Revised PSVT:R.  Overall performance on the Revised PSVT:R determined whether participants were considered high or low performers: students scoring 90% or higher were considered high performers (5 students), whereas students scoring under 90% were considered low performers (3 students). Time-frequency analysis of the EEG data quantified power in several oscillatory frequency bands (alpha, beta, theta, gamma, delta) for comparison between low and high performers, as well as between difficulty levels of the spatial rotation problems.   Although we did not find any significant effects of performance type (high, low) on EEG power, we observed a trend in reduced absolute delta and gamma power for hard problems relative to easier problems. Decreases in delta power have been reported elsewhere for difficult relative to easy arithmetic calculations, and attributed to greater external attention (e.g., attention to the stimuli/numbers), and consequently, reduced internal attention (e.g., mentally performing the calculation). In the current task, a total of three spatial objects are presented. An example rotation stimulus is presented, showing a spatial object before and after rotation. A target stimulus, or spatial object before rotation is then displayed. Students must choose one of five stimuli (multiple choice options) that indicates the correct representation of the object after rotation. Reduced delta power in the current task implies that students showed greater attention to the example and target stimuli for the hard problem, relative to the moderate and easy problems. Therefore, preliminary findings suggest that students are less efficient at encoding the target stimuli (external attention) prior to mental rotation (internal attention) when task difficulty increases.  Our findings indicate that delta power may be used to identify spatial rotation items that are especially challenging for students. We may then determine the efficacy of spatial rotation interventions among engineering education students, using delta power as an index for increases in internal attention (e.g., increased delta power). Further, in future work, we will also use eye-tracking to assess whether our intervention decreases eye fixation (e.g., time spent viewing) toward the target stimulus on the Revised PSVT:R. By simultaneously using EEG and eye-tracking, we may identify changes in internal attention and encoding of the target stimuli that are predictive of improvements in spatial rotation skills among engineering education students.  
    more » « less
  3. null (Ed.)
    Qualitative analysis of verbal data is of central importance in the learning sciences. It is labor-intensive and time-consuming, however, which limits the amount of data researchers can include in studies. This work is a step towards building a statistical machine learning (ML) method for achieving an automated support for qualitative analyses of students' writing, here specifically in score laboratory reports in introductory biology for sophistication of argumentation and reasoning. We start with a set of lab reports from an undergraduate biology course, scored by a four-level scheme that considers the complexity of argument structure, the scope of evidence, and the care and nuance of conclusions. Using this set of labeled data, we show that a popular natural language modeling processing pipeline, namely vector representation of words, a.k.a word embeddings, followed by Long Short Term Memory (LSTM) model for capturing language generation as a state-space model, is able to quantitatively capture the scoring, with a high Quadratic Weighted Kappa (QWK) prediction score, when trained in via a novel contrastive learning set-up. We show that the ML algorithm approached the inter-rater reliability of human analysis. Ultimately, we conclude, that machine learning (ML) for natural language processing (NLP) holds promise for assisting learning sciences researchers in conducting qualitative studies at much larger scales than is currently possible. 
    more » « less
  4. With the increasing prevalence of large language models (LLMs) such as ChatGPT, there is a growing need to integrate natural language processing (NLP) into K-12 education to better prepare young learners for the future AI landscape. NLP, a sub-field of AI that serves as the foundation of LLMs and many advanced AI applications, holds the potential to enrich learning in core subjects in K-12 classrooms. In this experience report, we present our efforts to integrate NLP into science classrooms with 98 middle school students across two US states, aiming to increase students’ experience and engagement with NLP models through textual data analyses and visualizations. We designed learning activities, developed an NLP-based interactive visualization platform, and facilitated classroom learning in close collaboration with middle school science teachers. This experience report aims to contribute to the growing body of work on integrating NLP into K-12 education by providing insights and practical guidelines for practitioners, researchers, and curriculum designers. 
    more » « less
  5. Abstract

    Neuropsychiatric disorders pose a high societal cost, but their treatment is hindered by lack of objective outcomes and fidelity metrics. AI technologies and specifically Natural Language Processing (NLP) have emerged as tools to study mental health interventions (MHI) at the level of their constituent conversations. However, NLP’s potential to address clinical and research challenges remains unclear. We therefore conducted a pre-registered systematic review of NLP-MHI studies using PRISMA guidelines (osf.io/s52jh) to evaluate their models, clinical applications, and to identify biases and gaps. Candidate studies (n = 19,756), including peer-reviewed AI conference manuscripts, were collected up to January 2023 through PubMed, PsycINFO, Scopus, Google Scholar, and ArXiv. A total of 102 articles were included to investigate their computational characteristics (NLP algorithms, audio features, machine learning pipelines, outcome metrics), clinical characteristics (clinical ground truths, study samples, clinical focus), and limitations. Results indicate a rapid growth of NLP MHI studies since 2019, characterized by increased sample sizes and use of large language models. Digital health platforms were the largest providers of MHI data. Ground truth for supervised learning models was based on clinician ratings (n = 31), patient self-report (n = 29) and annotations by raters (n = 26). Text-based features contributed more to model accuracy than audio markers. Patients’ clinical presentation (n = 34), response to intervention (n = 11), intervention monitoring (n = 20), providers’ characteristics (n = 12), relational dynamics (n = 14), and data preparation (n = 4) were commonly investigated clinical categories. Limitations of reviewed studies included lack of linguistic diversity, limited reproducibility, and population bias. A research framework is developed and validated (NLPxMHI) to assist computational and clinical researchers in addressing the remaining gaps in applying NLP to MHI, with the goal of improving clinical utility, data access, and fairness.

     
    more » « less