skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 1, 2026

Title: Can AI Develop Curriculum? Integrated Computer Science As a Test Case (Research to Practice)
Introduction: Because developing integrated computer science (CS) curriculum is a resource-intensive process, there is interest in leveraging the capabilities of AI tools, including large language models (LLMs), to streamline this task. However, given the novelty of LLMs, little is known about their ability to generate appropriate curriculum content. Research Question: How do current LLMs perform on the task of creating appropriate learning activities for integrated computer science education? Methods: We tested two LLMs (Claude 3.5 Sonnet and ChatGPT 4-o) by providing them with a subset of national learning standards for both CS and language arts and asking them to generate a high-level description of learning activities that met standards for both disciplines. Four humans rated the LLM output – using an aggregate rating approach – in terms of (1) whether it met the CS learning standard, (2) whether it met the language arts learning standard, (3) whether it was equitable, and (4) its overall quality. Results: For Claude AI, 52% of the activities met language arts standards, 64% met CS standards, and the average quality rating was middling. For ChatGPT, 75% of the activities met language arts standards, 63% met CS standards, and the average quality rating was low. Virtually all activities from both LLMs were rated as neither actively promoting nor inhibiting equitable instruction. Discussion: Our results suggest that LLMs are not (yet) able to create appropriate learning activities from learning standards. The activities were generally not usable by classroom teachers without further elaboration and/or modification. There were also grammatical errors in the output, something not common with LLM-produced text. Further, standards in one or both disciplines were often not addressed, and the quality of the activities was often low. We conclude with recommendations for the use of LLMs in curriculum development in light of these findings.  more » « less
Award ID(s):
2311746
PAR ID:
10643576
Author(s) / Creator(s):
 ;  ; ;
Publisher / Repository:
ASEE Conferences
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Introduction: Recent AI advances, particularly the introduction of large language models (LLMs), have expanded the capacity to automate various tasks, including the analysis of text. This capability may be especially helpful in education research, where lack of resources often hampers the ability to perform various kinds of analyses, particularly those requiring a high level of expertise in a domain and/or a large set of textual data. For instance, we recently coded approximately 10,000 state K-12 computer science standards, requiring over 200 hours of work by subject matter experts. If LLMs are capable of completing a task such as this, the savings in human resources would be immense. Research Questions: This study explores two research questions: (1) How do LLMs compare to humans in the performance of an education research task? and (2) What do errors in LLM performance on this task suggest about current LLM capabilities and limitations? Methodology: We used a random sample of state K-12 computer science standards. We compared the output of three LLMs – ChatGPT, Llama, and Claude – to the work of human subject matter experts in coding the relationship between each state standard and a set of national K-12 standards. Specifically, the LLMs and the humans determined whether each state standard was identical to, similar to, based on, or different from the national standards and (if it was not different) which national standard it resembled. Results: Each of the LLMs identified a different national standard than the subject matter expert in about half of instances. When the LLM identified the same standard, it usually categorized the type of relationship (i.e., identical to, similar to, based on) in the same way as the human expert. However, the LLMs sometimes misidentified ‘identical’ standards. Discussion: Our results suggest that LLMs are not currently capable of matching human performance on the task of classifying learning standards. The mis-identification of some state standards as identical to national standards – when they clearly were not – is an interesting error, given that traditional computing technologies can easily identify identical text. Similarly, some of the mismatches between the LLM and human performance indicate clear errors on the part of the LLMs. However, some of the mismatches are difficult to assess, given the ambiguity inherent in this task and the potential for human error. We conclude the paper with recommendations for the use of LLMs in education research based on these findings. 
    more » « less
  2. The authors discuss how exposure to Generative Artificial Intelligence (GenAI) tools led them to reconsider their approach to supporting secondary preservice teachers in curriculum development and to acknowledge both the challenges and opportunities these technologies present in teacher preparation. An upper division undergraduate class designed to support curricular development of all secondary teachers was redesigned to incorporate the use of large language models (LLMs) such as ChatGPT. The findings indicate that GenAI assists in planning, assessment design, and the development of assignments that require human insight beyond AI capabilities. The study highlights the benefits of GenAI in curriculum development and addresses concerns about academic integrity and equitable access. The authors recommend that educators explore GenAI’s potential to support learning and develop strategies for its responsible use in teacher preparation. 
    more » « less
  3. Large Language Models (LLMs) have made significant strides in various intelligent tasks but still struggle with complex action reasoning tasks that require systematic search. To address this limitation, we introduce a method that bridges the natural language understanding capability of LLMs with the symbolic reasoning capability of action languages---formal languages for reasoning about actions. Our approach, termed {\sf LLM+AL}, leverages the LLM's strengths in semantic parsing and commonsense knowledge generation alongside the action language's expertise in automated reasoning based on encoded knowledge. We compare {\sf LLM+AL} against state-of-the-art LLMs, including {\sc ChatGPT-4}, {\sc Claude 3 Opus}, {\sc Gemini Ultra 1.0}, and {\sc o1-preview}, using benchmarks for complex reasoning about actions. Our findings indicate that while all methods exhibit various errors, {\sf LLM+AL}, with relatively simple human corrections, consistently leads to correct answers, whereas using LLMs alone does not yield improvements even after human intervention. {\sf LLM+AL} also contributes to automated generation of action languages. 
    more » « less
  4. In the early stages of K-12 Computer Science (CS) curriculum development, standards were not yet established, and the primary objective, especially in younger grades, was to spark students' interest in CS. While this remains a vital goal, the development of the CS standards underscores the importance of standards-aligned curriculum, ensuring equitable, content rich CS education for all students. We show that standards alignment is most useful when it includes details about which aspects of the standards a curriculum aligns with. This paper describes our process of decomposing five middle school CS standards into granular learning targets using an evidence-centered design approach and mapping the learning targets onto individual lessons from one widely popular middle school CS curriculum. We discuss the potential implications of this work on curriculum design, curriculum selection, and teacher professional learning. 
    more » « less
  5. With the increasing prevalence of large language models (LLMs) such as ChatGPT, there is a growing need to integrate natural language processing (NLP) into K-12 education to better prepare young learners for the future AI landscape. NLP, a sub-field of AI that serves as the foundation of LLMs and many advanced AI applications, holds the potential to enrich learning in core subjects in K-12 classrooms. In this experience report, we present our efforts to integrate NLP into science classrooms with 98 middle school students across two US states, aiming to increase students’ experience and engagement with NLP models through textual data analyses and visualizations. We designed learning activities, developed an NLP-based interactive visualization platform, and facilitated classroom learning in close collaboration with middle school science teachers. This experience report aims to contribute to the growing body of work on integrating NLP into K-12 education by providing insights and practical guidelines for practitioners, researchers, and curriculum designers. 
    more » « less