skip to main content


This content will become publicly available on June 1, 2024

Title: Generating multiple choice questions from a textbook: LLMs match human performance on most metrics
Multiple choice questions are traditionally expensive to produce. Recent advances in large language models (LLMs) have led to fine-tuned LLMs that generate questions competitive with human-authored questions. However, the relative capabilities of ChatGPT-family models have not yet been established for this task. We present a carefully-controlled human evaluation of three conditions: a fine-tuned, augmented version of Macaw, instruction-tuned Bing Chat with zero-shot prompting, and humanauthored questions from a college science textbook. Our results indicate that on six of seven measures tested, both LLM’s performance was not significantly different from human performance. Analysis of LLM errors further suggests that Macaw and Bing Chat have different failure modes for this task: Macaw tends to repeat answer options whereas Bing Chat tends to not include the specified answer in the answer options. For Macaw, removing error items from analysis results in performance on par with humans for all metrics; for Bing Chat, removing error items improves performance but does not reach human-level performance.  more » « less
Award ID(s):
1934745
NSF-PAR ID:
10447593
Author(s) / Creator(s):
Date Published:
Journal Name:
Workshop on Empowering Education with LLMs - the Next-Gen Interface and Content Generation at the AIED’23 Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Moore, S ; Stamper, J ; Cao, T ; Liu, Z ; Hu, X ; Lu, Y ; Liang, J ; Khosravi, H ; Denny, P ; Singh, A (Ed.)
    Multiple choice questions are traditionally expensive to produce. Recent advances in large language models (LLMs) have led to fine-tuned LLMs that generate questions competitive with human-authored questions. However, the relative capabilities of ChatGPT-family models have not yet been established for this task. We present a carefully-controlled human evaluation of three conditions: a fine-tuned, augmented version of Macaw, instruction-tuned Bing Chat with zero-shot prompting, and humanauthored questions from a college science textbook. Our results indicate that on six of seven measures tested, both LLM’s performance was not significantly different from human performance. Analysis of LLM errors further suggests that Macaw and Bing Chat have different failure modes for this task: Macaw tends to repeat answer options whereas Bing Chat tends to not include the specified answer in the answer options. For Macaw, removing error items from analysis results in performance on par with humans for all metrics; for Bing Chat, removing error items improves performance but does not reach human-level performance. 
    more » « less
  2. Fancsali, Stephen E. ; Rus, Vasile (Ed.)

    Multi-angle question answering models have recently been proposed that promise to perform related tasks like question generation. However, performance on related tasks has not been thoroughly studied. We investigate a leading model called Macaw on the task of multiple choice question generation and evaluate its performance on three angles that systematically reduce the complexity of the task. Our results indicate that despite the promise of generalization, Macaw performs poorly on untrained angles. Even on a trained angle, Macaw fails to generate four distinct multiple-choice options on 17% of inputs. We propose augmenting multiple- choice options by paraphrasing angle input and show this increases overall success to 97.5%. A human evaluation comparing the augmented multiple-choice questions with textbook questions on the same topic reveals that Macaw questions broadly score highly but below human questions.

     
    more » « less
  3. Multi-angle question answering models have recently been proposed that promise to perform related tasks like question generation. However, performance on related tasks has not been thoroughly studied. We investigate a leading model called Macaw on the task of multiple choice question generation and evaluate its performance on three angles that systematically reduce the complexity of the task. Our results indicate that despite the promise of generalization, Macaw performs poorly on untrained angles. Even on a trained angle, Macaw fails to generate four distinct multiple-choice options on 17% of inputs. We propose augmenting multiple choice options by paraphrasing angle input and show this increases overall success to 97.5%. A human evaluation comparing the augmented multiple-choice questions with textbook questions on the same topic reveals that Macaw questions broadly score highly but below human questions. 
    more » « less
  4. Background Chatbots are being piloted to draft responses to patient questions, but patients’ ability to distinguish between provider and chatbot responses and patients’ trust in chatbots’ functions are not well established. Objective This study aimed to assess the feasibility of using ChatGPT (Chat Generative Pre-trained Transformer) or a similar artificial intelligence–based chatbot for patient-provider communication. Methods A survey study was conducted in January 2023. Ten representative, nonadministrative patient-provider interactions were extracted from the electronic health record. Patients’ questions were entered into ChatGPT with a request for the chatbot to respond using approximately the same word count as the human provider’s response. In the survey, each patient question was followed by a provider- or ChatGPT-generated response. Participants were informed that 5 responses were provider generated and 5 were chatbot generated. Participants were asked—and incentivized financially—to correctly identify the response source. Participants were also asked about their trust in chatbots’ functions in patient-provider communication, using a Likert scale from 1-5. Results A US-representative sample of 430 study participants aged 18 and older were recruited on Prolific, a crowdsourcing platform for academic studies. In all, 426 participants filled out the full survey. After removing participants who spent less than 3 minutes on the survey, 392 respondents remained. Overall, 53.3% (209/392) of respondents analyzed were women, and the average age was 47.1 (range 18-91) years. The correct classification of responses ranged between 49% (192/392) to 85.7% (336/392) for different questions. On average, chatbot responses were identified correctly in 65.5% (1284/1960) of the cases, and human provider responses were identified correctly in 65.1% (1276/1960) of the cases. On average, responses toward patients’ trust in chatbots’ functions were weakly positive (mean Likert score 3.4 out of 5), with lower trust as the health-related complexity of the task in the questions increased. Conclusions ChatGPT responses to patient questions were weakly distinguishable from provider responses. Laypeople appear to trust the use of chatbots to answer lower-risk health questions. It is important to continue studying patient-chatbot interaction as chatbots move from administrative to more clinical roles in health care. 
    more » « less
  5. Instruction-tuned large language models (LLMs), such as ChatGPT, have led to promising zero-shot performance in discriminative natural language understanding (NLU) tasks. This involves querying the LLM using a prompt containing the question, and the candidate labels to choose from. The question-answering capabilities of ChatGPT arise from its pre-training on large amounts of human-written text, as well as its subsequent fine-tuning on human preferences, which motivates us to ask: Does ChatGPT also inherit humans’ cognitive biases? In this paper, we study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer. We have two main findings: i) ChatGPT’s decision is sensitive to the order of labels in the prompt; ii) ChatGPT has a clearly higher chance to select the labels at earlier positions as the answer. We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions. We release the source code at https://github.com/wangywUST/PrimacyEffectGPT. 
    more » « less