skip to main content


Search for: All records

Award ID contains: 1903304

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. This exploratory study delves into the complex challenge of analyzing and interpreting student responses to mathematical problems, typically conveyed through image formats within online learning platforms. The main goal of this research is to identify and differentiate various student strategies within a dataset comprising image-based mathematical work. A comprehensive approach is implemented, including various image representation, preprocessing, and clustering techniques, each evaluated to fulfill the study’s objectives. The exploration spans several methods for enhanced image representation, extending from conventional pixel-based approaches to the innovative deployment of CLIP embeddings. Given the prevalent noise and variability in our dataset, an ablation study is conducted to meticulously evaluate the impact of various preprocessing steps, assessing their potency in eradicating extraneous backgrounds and noise to more precisely isolate relevant mathematical content. Two clustering approaches—k-means and hierarchical clustering—are employed to categorize images based on student strategies that underlies their responses. Preliminary results underscore the hierarchical clustering method could distinguish between student strategies effectively. Our study lays down a robust framework for characterizing and understanding student strategies in online mathematics problem-solving, paving the way for future research into scalable and precise analytical methodologies while introducing a novel open-source image dataset for the learning analytics research community. 
    more » « less
    Free, publicly-accessible full text available January 1, 2025
  2. Human-conducted rating tasks are resource-intensive and demand significant time and financial commitments. As Large Language Models (LLMs) like GPT emerge and exhibit prowess across various domains, their potential in automating such evaluation tasks becomes evident. In this research, we leveraged four prominent LLMs: GPT-4, GPT-3.5, Vicuna, and PaLM 2, to scrutinize their aptitude in evaluating teacher-authored mathematical explanations. We utilized a detailed rubric that encompassed accuracy, explanation clarity, the correctness of mathematical notation, and the efficacy of problem-solving strategies. During our investigation, we unexpectedly discerned the influence of HTML formatting on these evaluations. Notably, GPT-4 consistently favored explanations formatted with HTML, whereas the other models displayed mixed inclinations. When gauging Inter-Rater Reliability (IRR) among these models, only Vicuna and PaLM 2 demonstrated high IRR using the conventional Cohen’s Kappa metric for explanations formatted with HTML. Intriguingly, when a more relaxed version of the metric was applied, all model pairings showcased robust agreement. These revelations not only underscore the potential of LLMs in providing feedback on student-generated content but also illuminate new avenues, such as reinforcement learning, which can harness the consistent feedback from these models. 
    more » « less
    Free, publicly-accessible full text available January 1, 2025
  3. The development and measurable improvements in performance of large language models on natural language tasks opens the opportunity to utilize large language models in an educational setting to replicate human tutoring, which is often costly and inaccessible. We are particularly interested in large language models from the GPT series, created by OpenAI. In the original study we found that the quality of explanations generated with GPT-3.5 was poor, where two different approaches to generating explanations resulted in a 43% and 10% successrate. In a replication study, we were interested in whether the measurable improvements in GPT-4 performance led to a higher rate of success for generating valid explanations compared to GPT-3.5. A replication of the original study was conducted by using GPT-4 to generate explanations for the same problems given to GPT-3.5. Using GPT-4, explanation correctness dramatically improved to a success rate of 94%. We were further interested in evaluating if GPT-4 explanations were positively perceived compared to human-written explanations. A preregistered, follow-up study was implemented where 10 evaluators were asked to rate the quality of randomized GPT-4 and teacher-created explanations. Even with 4% of problems containing some amount of incorrect content, GPT-4 explanations were preferred over human explanations. 
    more » « less
    Free, publicly-accessible full text available January 1, 2025
  4. We present a conversational AI tutor (CAIT) for the purpose of aiding students on middle school math problems. CAIT was created utilizing the CLASS framework, and it is an LLM fine-tuned on Vicuna using a conversational dataset created by prompting ChatGPT using problems and explanations in ASSISTments. CAIT is trained to generate scaffolding questions, provide hints, and correct mistakes on math problems. We find that CAIT identifies 60% of correct answers as correct, generates effective sub-problems 33% of the time, and has a positive sentiment 72% of the time, with the remaining 28% of interactions being neutral. This paper discusses the hurdles to further implementation of CAIT into ASSISTments, namely improved accuracy and efficacy of sub-problems, and establishes CAIT as a proof of concept that the CLASS framework can be applied to create an effective mathematics tutorbot. 
    more » « less
    Free, publicly-accessible full text available January 1, 2025
  5. Feedback is a crucial factor in mathematics learning and instruction. Whether expressed as indicators of correctness or textual comments, feedback can help guide students’ understanding of content. Beyond this, however, teacher-written messages and comments can provide motivational and affective benefits for students. The question emerges as to what constitutes effective feedback to promote not only student learning but also motivation and engagement. Teachers may have different perceptions of what constitutes effective feedback utilizing different tones in their writing to communicate their sentiment while assessing student work. This study aims to investigate trends in teacher sentiment and tone when providing feedback to students in a middle school mathematics class context. Toward this, we examine the applicability of state-of-the-art sentiment analysis methods in a mathematics context and explore the use of punctuation marks in teacher feedback messages as a measure of tone. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  6. In order to facilitate student learning, it is important to identify and remediate misconceptions and incomplete knowledge pertaining to the assigned material. In the domain of mathematics, prior research with computer-based learning systems has utilized the commonality of incorrect answers to problems as a way of identifying potential misconceptions among students. Much of this research, however, has been limited to the use of close-ended questions, such as multiple-choice and fill-in-the-blank problems. In this study, we explore the potential usage of natural language processing and clustering methods to examine potential misconceptions across student answers to both close- and openended problems. We find that our proposed methods show promise for distinguishing misconception from non-conception, but may need further development to improve the interpretability of specific misunderstandings exhibited through student explanations. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  7. Teachers often rely on the use of a range of open-ended problems to assess students’ understanding of mathematical concepts. Beyond traditional conceptions of student open- ended work, commonly in the form of textual short-answer or essay responses, the use of figures, tables, number lines, graphs, and pictographs are other examples of open-ended work common in mathematics. While recent developments in areas of natural language processing and machine learning have led to automated methods to score student open-ended work, these methods have largely been limited to textual an- swers. Several computer-based learning systems allow stu- dents to take pictures of hand-written work and include such images within their answers to open-ended questions. With that, however, there are few-to-no existing solutions that support the auto-scoring of student hand-written or drawn answers to questions. In this work, we build upon an ex- isting method for auto-scoring textual student answers and explore the use of OpenAI/CLIP, a deep learning embedding method designed to represent both images and text, as well as Optical Character Recognition (OCR) to improve model performance. We evaluate the performance of our method on a dataset of student open-responses that contains both text- and image-based responses, and find a reduction of model error in the presence of images when controlling for other answer-level features. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  8. Background: Teachers often rely on the use of open‐ended questions to assess students' conceptual understanding of assigned content. Particularly in the context of mathematics; teachers use these types of questions to gain insight into the processes and strategies adopted by students in solving mathematical problems beyond what is possible through more close‐ended problem types. While these types of problems are valuable to teachers, the variation in student responses to these questions makes it difficult, and time‐consuming, to evaluate and provide directed feedback. It is a well‐studied concept that feedback, both in terms of a numeric score but more importantly in the form of teacher‐authored comments, can help guide students as to how to improve, leading to increased learning. It is for this reason that teachers need better support not only for assessing students' work but also in providing meaningful and directed feedback to students. Objectives: In this paper, we seek to develop, evaluate, and examine machine learning models that support automated open response assessment and feedback. Methods: We build upon the prior research in the automatic assessment of student responses to open‐ended problems and introduce a novel approach that leverages student log data combined with machine learning and natural language processing methods. Utilizing sentence‐level semantic representations of student responses to open‐ended questions, we propose a collaborative filtering‐based approach to both predict student scores as well as recommend appropriate feedback messages for teachers to send to their students. Results and Conclusion: We find that our method outperforms previously published benchmarks across three different metrics for the task of predicting student performance. Through an error analysis, we identify several areas where future works maybe able to improve upon our approach. 
    more » « less
  9. Automatic short answer grading is an important research direction in the exploration of how to use artificial intelligence (AI)-based tools to improve education. Current state-of-theart approaches use neural language models to create vectorized representations of students responses, followed by classifiers to predict the score. However, these approaches have several key limitations, including i) they use pre-trained language models that are not well-adapted to educational subject domains and/or student-generated text and ii) they almost always train one model per question, ignoring the linkage across question and result in a significant model storage problem due to the size of advanced language models. In this paper, we study the problem of automatic short answer grading for students’ responses to math questions and propose a novel framework for this task. First, we use MathBERT, a variant of the popular language model BERT adapted to mathematical content, as our base model and fine-tune it on the downstream task of student response grading. Second, we use an in-context learning approach that provides scoring examples as input to the language model to provide additional context information and promote generalization to previously unseen questions. We evaluate our framework on a real-world dataset of student responses to open-ended math questions and show that our framework (often significantly) outperform existing approaches, especially for new questions that are not seen during training. 
    more » « less