skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 1, 2026

Title: Can LLMs Assist with Education Research? The Case of Computer Science Standards Analysis
Introduction: Recent AI advances, particularly the introduction of large language models (LLMs), have expanded the capacity to automate various tasks, including the analysis of text. This capability may be especially helpful in education research, where lack of resources often hampers the ability to perform various kinds of analyses, particularly those requiring a high level of expertise in a domain and/or a large set of textual data. For instance, we recently coded approximately 10,000 state K-12 computer science standards, requiring over 200 hours of work by subject matter experts. If LLMs are capable of completing a task such as this, the savings in human resources would be immense. Research Questions: This study explores two research questions: (1) How do LLMs compare to humans in the performance of an education research task? and (2) What do errors in LLM performance on this task suggest about current LLM capabilities and limitations? Methodology: We used a random sample of state K-12 computer science standards. We compared the output of three LLMs – ChatGPT, Llama, and Claude – to the work of human subject matter experts in coding the relationship between each state standard and a set of national K-12 standards. Specifically, the LLMs and the humans determined whether each state standard was identical to, similar to, based on, or different from the national standards and (if it was not different) which national standard it resembled. Results: Each of the LLMs identified a different national standard than the subject matter expert in about half of instances. When the LLM identified the same standard, it usually categorized the type of relationship (i.e., identical to, similar to, based on) in the same way as the human expert. However, the LLMs sometimes misidentified ‘identical’ standards. Discussion: Our results suggest that LLMs are not currently capable of matching human performance on the task of classifying learning standards. The mis-identification of some state standards as identical to national standards – when they clearly were not – is an interesting error, given that traditional computing technologies can easily identify identical text. Similarly, some of the mismatches between the LLM and human performance indicate clear errors on the part of the LLMs. However, some of the mismatches are difficult to assess, given the ambiguity inherent in this task and the potential for human error. We conclude the paper with recommendations for the use of LLMs in education research based on these findings.  more » « less
Award ID(s):
2311746
PAR ID:
10643575
Author(s) / Creator(s):
; ;  ; ;
Publisher / Repository:
ASEE Conferences
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In the United States, state learning standards guide curriculum, assessment, teacher certification, and other key drivers of the student learning experience. Investigating standards allows us to answer a lot of big questions about the field of K-12 computer science (CS) education. Our team has created a dataset of state-level K-12 CS standards for all US states that currently have such standards (n = 42). This dataset was created by CS subject matter experts, who - for each of the approximately 10,000 state CS standards - manually tagged its assigned grade level/band, category/topic, and, if applicable, which CSTA standard it is identical or similar to. We also determined the standards' cognitive complexity using Bloom's Revised Taxonomy. Using the dataset, we were able to analyze each state's CS standards using a variety of metrics and approaches. To our knowledge, this is the first comprehensive, publicly available dataset of state CS standards that includes the factors mentioned previously. We believe that this dataset will be useful to other CS education researchers, including those who want to better understand the state and national landscape of K-12 CS education in the US, the characteristics of CS learning standards, the coverage of particular CS topics (e.g., cybersecurity, AI), and many other topics. In this lightning talk, we will introduce the dataset's features as well as some tools that we have developed (e.g., to determine a standard's Bloom's level) that may be useful to others who use the dataset. 
    more » « less
  2. Introduction: Because developing integrated computer science (CS) curriculum is a resource-intensive process, there is interest in leveraging the capabilities of AI tools, including large language models (LLMs), to streamline this task. However, given the novelty of LLMs, little is known about their ability to generate appropriate curriculum content. Research Question: How do current LLMs perform on the task of creating appropriate learning activities for integrated computer science education? Methods: We tested two LLMs (Claude 3.5 Sonnet and ChatGPT 4-o) by providing them with a subset of national learning standards for both CS and language arts and asking them to generate a high-level description of learning activities that met standards for both disciplines. Four humans rated the LLM output – using an aggregate rating approach – in terms of (1) whether it met the CS learning standard, (2) whether it met the language arts learning standard, (3) whether it was equitable, and (4) its overall quality. Results: For Claude AI, 52% of the activities met language arts standards, 64% met CS standards, and the average quality rating was middling. For ChatGPT, 75% of the activities met language arts standards, 63% met CS standards, and the average quality rating was low. Virtually all activities from both LLMs were rated as neither actively promoting nor inhibiting equitable instruction. Discussion: Our results suggest that LLMs are not (yet) able to create appropriate learning activities from learning standards. The activities were generally not usable by classroom teachers without further elaboration and/or modification. There were also grammatical errors in the output, something not common with LLM-produced text. Further, standards in one or both disciplines were often not addressed, and the quality of the activities was often low. We conclude with recommendations for the use of LLMs in curriculum development in light of these findings. 
    more » « less
  3. Code LLMs have the potential to make it easier for non-experts to understand and write code. However, current CodeLLM benchmarks rely on a single expert-written prompt per problem, making it hard to generalize their success to non-expert users. In this paper, we present a new natural-language-to-code benchmark of prompts written by a key population of non-experts: beginning programmers. StudentEval contains 1,749 prompts written by 80 students who have only completed one introductory Python course. StudentEval contains numerous non-expert prompts describing the same problem, enabling exploration of key factors in prompt success. We use StudentEval to evaluate 12 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. Our analysis of student prompting strategies reveals that nondeterministic LLM sampling can mislead students about the quality of their descriptions, a finding with key implications for Code LLMs in education. 
    more » « less
  4. Introduction: Learning standards are a crucial determinant of computer science (CS) education at the K-12 level, but they are not often researched despite their importance. We sought to address this gap with a mixed-methods study examining state and national K-12 CS standards. Research Question: What are the similarities and differences between state and national computer science standards? Methods: We tagged the state CS standards (n = 9695) according to their grade band/level, topic, course, and similarity to a Computer Science Teachers Association (CSTA) standard. We also analyzed the content of standards similar to CSTA standards to determine their topics, cognitive complexity, and other features. Results: We found some commonalities amidst broader diversity in approaches to organization and content across the states, relative to the CSTA standards. The content analysis showed that a common difference between state and CSTA standards is that the state standards tend to include concrete examples. We also found differences across states in how similar their standards are to CSTA standards, as well as differences in how cognitively complex the standards are. Discussion: Standards writers face many tensions and trade-offs, and this analysis shows how – in general terms – various states have chosen to manage those trade-offs in writing standards. For example, adding examples can improve clarity and specificity, but perhaps at the cost of brevity and longevity. A better understanding of the landscape of state standards can assist future standards writers, curriculum developers, and researchers in their work. 
    more » « less
  5. Thematic Analysis (TA) is a fundamental method in healthcare research for analyzing transcript data, but it is resource-intensive and difficult to scale for large, complex datasets. This study investigates the potential of large language models (LLMs) to augment the inductive TA process in high-stakes healthcare settings. Focusing on interview transcripts from parents of children with Anomalous Aortic Origin of a Coronary Artery (AAOCA), a rare congenital heart disease, we propose an LLM-Enhanced Thematic Analysis (LLM-TA) pipeline. Our pipeline integrates an affordable state-of-the-art LLM (GPT-4o mini), LangChain, and prompt engineering with chunking techniques to analyze nine detailed transcripts following the inductive TA framework. We evaluate the LLM-generated themes against human-generated results using thematic similarity metrics, LLM-assisted assessments, and expert reviews. Results demonstrate that our pipeline outperforms existing LLM-assisted TA methods significantly. While the pipeline alone has not yet reached human-level quality in inductive TA, it shows great potential to improve scalability, efficiency, and accuracy while reducing analyst workload when working collaboratively with domain experts. We provide practical recommendations for incorporating LLMs into high-stakes TA workflows and emphasize the importance of close collaboration with domain experts to address challenges related to real-world applicability and dataset complexity. 
    more » « less