Exposing students to low-quality assessments such as multiple-choice questions (MCQs) and short answer questions (SAQs) is detrimental to their learning, making it essential to accurately evaluate these assessments. Existing evaluation methods are often challenging to scale and fail to consider their pedagogical value within course materials. Online crowds offer a scalable and cost-effective source of intelligence, but often lack necessary domain expertise. Advancements in Large Language Models (LLMs) offer automation and scalability, but may also lack precise domain knowledge. To explore these trade-offs, we compare the effectiveness and reliability of crowdsourced and LLM-based methods for assessing the quality of 30 MCQs and SAQs across six educational domains using two standardized evaluation rubrics. We analyzed the performance of 84 crowdworkers from Amazon's Mechanical Turk and Prolific, comparing their quality evaluations to those made by the three LLMs: GPT-4, Gemini 1.5 Pro, and Claude 3 Opus. We found that crowdworkers on Prolific consistently delivered the highest-quality assessments, and GPT-4 emerged as the most effective LLM for this task. Our study reveals that while traditional crowdsourced methods often yield more accurate assessments, LLMs can match this accuracy in specific evaluative criteria. These results provide evidence for a hybrid approach to educational content evaluation, integrating the scalability of AI with the nuanced judgment of humans. We offer feasibility considerations in using AI to supplement human judgment in educational assessment. 
                        more » 
                        « less   
                    
                            
                            Generative Students: Using LLM-Simulated Student Profiles to Support Question Item Evaluation
                        
                    
    
            Evaluating the quality of automatically generated question items has been a long standing challenge. In this paper, we leverage LLMs to simulate student profiles and generate responses to multiple-choice questions (MCQs). The generative students' responses to MCQs can further support question item evaluation. We propose Generative Students, a prompt architecture designed based on the KLI framework. A generative student profile is a function of the list of knowledge components the student has mastered, has confusion about or has no evidence of knowledge of. We instantiate the Generative Students concept on the subject domain of heuristic evaluation. We created 45 generative students using GPT-4 and had them respond to 20 MCQs. We found that the generative students produced logical and believable responses that were aligned with their profiles. We then compared the generative students' responses to real students' responses on the same set of MCQs and found a high correlation. Moreover, there was considerable overlap in the difficult questions identified by generative students and real students. A subsequent case study demonstrated that an instructor could improve question quality based on the signals provided by Generative Students. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2302564
- PAR ID:
- 10519692
- Publisher / Repository:
- L@S '24: Proceedings of the Eleventh (2024) ACM Conference on Learning @ Scale
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            This work-in-progress paper describes a collaborative effort between engineering education and machine learning researchers to automate analysis of written responses to conceptually challenging questions in mechanics. These qualitative questions are often used in large STEM classes to support active learning pedagogies; they require minimum calculations and focus on the application of underlying physical phenomena to various situations. Active learning pedagogies using this type of questions has been demonstrated to increase student achievement (Freeman et al., 2014; Hake, 1998) and engagement (Deslauriers, et al., 2011) of all students (Haak et al., 2011). To emphasize reasoning and sense-making, we use the Concept Warehouse (Koretsky et al., 2014), an audience response system where students provide written justifications to concept questions. Written justifications better prepare students for discussions with peers and in the whole class and can also improve students’ answer choices (Koretsky et al., 2016a, 2016b). In addition to their use as a tool to foster learning, written explanations can also provide valuable information to concurrently assess that learning (Koretsky and Magana, 2019). However, in practice, there has been limited deployment of written justifications with concept questions, in part, because they provide a daunting amount of information for instructors to process and for researchers to analyze. In this study, we describe the initial evaluation of large pre-trained generative sequence-to-sequence language models (Raffel et al., 2019; Brown et al., 2020) to automate the laborious coding process of student written responses. Adaptation of machine learning algorithms in this context is challenging since each question targets specific concepts which elicit their own unique reasoning processes. This exploratory project seeks to utilize responses collected through the Concept Warehouse to identify viable strategies for adapting machine learning to support instructors and researchers in identifying salient aspects of student thinking and understanding with these conceptually challenging questions.more » « less
- 
            This paper presents the design and analysis of a pilot problem set deployed to engineering students to assess their retention of physics knowledge at the start of a statics course. The problem set was developed using the NSF-IUSE (grant #2315492) Learning Map project (LMap) and piloted in the spring and fall of 2024. The LMap process is rooted in the Analysis, Design, Development, Implementation, and Evaluation (ADDIE) model [1] and Backward Design [2,3], extending these principles to course sequences to align learning outcomes, assessments, and instructional practices. The primary motivation for this problem set (Statics Knowledge Inventory, SKI) was to evaluate students' understanding and retention of physics concepts at the beginning of a statics course. The SKI includes a combination of multiple-choice questions (MCQ) and procedural problems, filling a gap in widely-used concept inventories for physics and statics, such as the Force Concept Inventory (FCI) and Statics Concept Inventory (SCI), which evaluate learning gains within a course, rather than knowledge retention across courses. Using the LMap analysis and instructor consultations, we identified overlapping concepts and topics between Physics and Statics courses, referred to here as “interdependent learning outcomes” (ILOs). The problem set includes 15 questions—eight MCQs and seven procedural problems. Unlike most concept inventories, procedural problems were added to provide insight into students’ problem-solving approach and conceptual understanding. These problems require students to perform calculations, demonstrate their work, and assess their conceptual understanding of key topics, and allow the instructors to assess essential prerequisite skills like drawing free-body diagrams (FBDs), computing forces and moments, and performing basic vector calculation and unit conversions. Problems were selected and adapted from physics and statics textbooks, supplemented by instructor-designed questions to ensure full coverage of the ILOs. We used the revised 2D Bloom’s Taxonomy [4] and a 3D representation of it [5] to classify each problem within a 6x4 matrix (six cognitive processes x four knowledge dimensions). This classification provided instructors and students with a clear understanding of the cognitive level required for each problem. Additionally, we measured students’ perceived confidence and difficulty in each problem using two questions on a 3-point Likert scale. The first iteration of the problem set was administered to 19 students in the spring 2024 statics course. After analyzing their performance, we identified areas for improvement and revised the problem set, removing repetitive MCQs and restructuring the procedural problems into scaffolded, multi-part questions with associated rubrics for evaluation. The revised version, consisting of five MCQs and six procedural problems, was deployed to 136 students in the fall 2024 statics course. A randomly selected subset of student answers from the second iteration was graded and analyzed to compare with the first. This analysis will inform future efforts to evaluate knowledge retention and transfer in key skills across sequential courses. In collaboration with research teams developing concept inventories for mechanics courses, we aim to integrate these procedural problems into future inventories.more » « less
- 
            null; null; null; null (Ed.)We reflect on our ongoing journey in the educational Cybersecurity Assessment Tools (CATS) Project to create two concept inventories for cybersecurity. We identify key steps in this journey and important questions we faced. We explain the decisions we made and discuss the consequences of those decisions, highlighting what worked well and what might have gone better. The CATS Project is creating and validating two concept inventories—conceptual tests of understanding—that can be used to measure the effectiveness of various approaches to teaching and learning cybersecurity. The Cybersecurity Concept Inventory (CCI) is for students who have recently completed any first course in cybersecurity; the Cybersecurity Curriculum Assessment (CCA) is for students who have recently completed an undergraduate major or track in cybersecurity. Each assessment tool comprises 25 multiple-choice questions (MCQs) of various difficulties that target the same five core concepts, but the CCA assumes greater technical background. Key steps include defining project scope, identifying the core concepts, uncovering student misconceptions, creating scenarios, drafting question stems, developing distractor answer choices, generating educational materials, performing expert reviews, recruiting student subjects, organizing workshops, building community acceptance, forming a team and nurturing collaboration, adopting tools, and obtaining and using funding. Creating effective MCQs is difficult and time-consuming, and cybersecurity presents special challenges. Because cybersecurity issues are often subtle, where the adversarial model and details matter greatly, it is challenging to construct MCQs for which there is exactly one best but non-obvious answer. We hope that our experiences and lessons learned may help others create more effective concept inventories and assessments in STEM.more » « less
- 
            We reflect on our ongoing journey in the educational Cybersecurity Assessment Tools (CATS) Project to create two concept inventories for cybersecurity. We identify key steps in this journey and important questions we faced. We explain the decisions we made and discuss the consequences of those decisions, highlighting what worked well and what might have gone better. The CATS Project is creating and validating two concept inventories—conceptual tests of understanding—that can be used to measure the effectiveness of various approaches to teaching and learning cybersecurity. The Cybersecurity Concept Inventory (CCI) is for students who have recently completed any first course in cybersecurity; the Cybersecurity Curriculum Assessment (CCA) is for students who have recently completed an undergraduate major or track in cybersecurity. Each assessment tool comprises 25 multiple-choice questions (MCQs) of various difficulties that target the same five core concepts, but the CCA assumes greater technical background. Key steps include defining project scope, identifying the core concepts, uncovering student misconceptions, creating scenarios, drafting question stems, developing distractor answer choices, generating educational materials, performing expert reviews, recruiting student subjects, organizing workshops, building community acceptance, forming a team and nurturing collaboration, adopting tools, and obtaining and using funding. Creating effective MCQs is difficult and time-consuming, and cybersecurity presents special challenges. Because cybersecurity issues are often subtle, where the adversarial model and details matter greatly, it is challenging to construct MCQs for which there is exactly one best but non-obvious answer. We hope that our experiences and lessons learned may help others create more effective concept inventories and assessments in STEM.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    