skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Identifying Generative Artificial Intelligence Chatbot Use on Multiple-Choice, General Chemistry Exams Using Rasch Analysis
Generative artificial intelligence (AI) technology is expected to have a profound impact on chemical education. While there are certainly positive uses, some of which are being actively implemented even now, there is a reasonable concern about its use in cheating. Efforts are underway to detect generative AI usage on open-ended questions, lab reports, and essays, but its detection on multiple choice exams is largely unexplored. Here we propose the use of Rasch analysis to identify the unique behavioral pattern of ChatGPT on General Chemistry II, multiple choice exams. While raw statistics (e.g., average, ability, outfit) were insufficient to readily identify ChatGPT instances, a strategy of fixing the ability scale on high success questions and then refitting the outcomes dramatically enhanced its outlier behavior in terms of Z-standardized out-fit statistic and ability displacement. Setting the detection threshold to a true positive rate (TPR) of 1.0, a false positive rate (FPR) of <0.1 was obtained across a majority of the 20 exams investigated here. Furthermore, the receiver operating characteristic curve (i.e., FPR vs TPR) exhibited outstanding areas under the curve of >0.9 for nearly all exams. While limitations of this method are described and the analysis is by no means exhaustive, these outcomes suggest that the unique behavior patterns of generative AI chat bots can be identified using Rasch modeling and fit statistics.  more » « less
Award ID(s):
2327754
PAR ID:
10572086
Author(s) / Creator(s):
;
Publisher / Repository:
American Chemical Society
Date Published:
Journal Name:
Journal of Chemical Education
Volume:
101
Issue:
8
ISSN:
0021-9584
Page Range / eLocation ID:
3216 to 3223
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Harmful textual content is pervasive on social media, poisoning online communities and negatively impacting participation. A common approach to this issue is developing detection models that rely on human annotations. However, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. Generative AI models have the potential to understand and detect harmful textual content. We used ChatGPT to investigate this potential and compared its performance with MTurker annotations for three frequently discussed concepts related to harmful textual content on social media: Hateful, Offensive, and Toxic (HOT). We designed five prompts to interact with ChatGPT and conducted four experiments eliciting HOT classifications. Our results show that ChatGPT can achieve an accuracy of approximately 80% when compared to MTurker annotations. Specifically, the model displays a more consistent classification for non-HOT comments than HOT comments compared to human annotations. Our findings also suggest that ChatGPT classifications align with the provided HOT definitions. However, ChatGPT classifies “hateful” and “offensive” as subsets of “toxic.” Moreover, the choice of prompts used to interact with ChatGPT impacts its performance. Based on these insights, our study provides several meaningful implications for employing ChatGPT to detect HOT content, particularly regarding the reliability and consistency of its performance, its understanding and reasoning of the HOT concept, and the impact of prompts on its performance. Overall, our study provides guidance on the potential of using generative AI models for moderating large volumes of user-generated textual content on social media. 
    more » « less
  2. Abstract In the face of climate change, climate literacy is becoming increasingly important. With wide access to generative AI tools, such as OpenAI’s ChatGPT, we explore the potential of AI platforms for ordinary citizens asking climate literacy questions. Here, we focus on a global scale and collect responses from ChatGPT (GPT-3.5 and GPT-4) on climate change-related hazard prompts over multiple iterations by utilizing the OpenAI’s API and comparing the results with credible hazard risk indices. We find a general sense of agreement in comparisons and consistency in ChatGPT over the iterations. GPT-4 displayed fewer errors than GPT-3.5. Generative AI tools may be used in climate literacy, a timely topic of importance, but must be scrutinized for potential biases and inaccuracies moving forward and considered in a social context. Future work should identify and disseminate best practices for optimal use across various generative AI tools. 
    more » « less
  3. Abstract BackgroundSpreading depolarizations (SDs) are a biomarker and a potentially treatable mechanism of worsening brain injury after traumatic brain injury (TBI). Noninvasive detection of SDs could transform critical care for brain injury patients but has remained elusive. Current methods to detect SDs are based on invasive intracranial recordings with limited spatial coverage. In this study, we establish the feasibility of automated SD detection through noninvasive scalp electroencephalography (EEG) for patients with severe TBI. MethodsBuilding on our recent WAVEFRONT algorithm, we designed an automated SD detection method. This algorithm, with learnable parameters and improved velocity estimation, extracts and tracks propagating power depressions using low-density EEG. The dataset for testing our algorithm contains 700 total SDs in 12 severe TBI patients who underwent decompressive hemicraniectomy (DHC), labeled using ground-truth intracranial EEG recordings. We utilize simultaneously recorded, continuous, low-density (19 electrodes) scalp EEG signals, to quantify the detection accuracy of WAVEFRONT in terms of true positive rate (TPR), false positive rate (FPR), as well as the accuracy of estimating SD frequency. ResultsWAVEFRONT achieves the best average validation accuracy using Delta band EEG: 74% TPR with less than 1.5% FPR. Further, preliminary evidence suggests WAVEFRONT can estimate how frequently SDs may occur. ConclusionsWe establish the feasibility, and quantify the performance, of noninvasive SD detection after severe TBI using an automated algorithm. The algorithm, WAVEFRONT, can also potentially be used for diagnosis, monitoring, and tailoring treatments for worsening brain injury. Extension of these results to patients with intact skulls requires further study. 
    more » « less
  4. Abstract Though consistently shown to detect mammographically occult cancers, breast ultrasound has been noted to have high false-positive rates. In this work, we present an AI system that achieves radiologist-level accuracy in identifying breast cancer in ultrasound images. Developed on 288,767 exams, consisting of 5,442,907 B-mode and Color Doppler images, the AI achieves an area under the receiver operating characteristic curve (AUROC) of 0.976 on a test set consisting of 44,755 exams. In a retrospective reader study, the AI achieves a higher AUROC than the average of ten board-certified breast radiologists (AUROC: 0.962 AI, 0.924 ± 0.02 radiologists). With the help of the AI, radiologists decrease their false positive rates by 37.3% and reduce requested biopsies by 27.8%, while maintaining the same level of sensitivity. This highlights the potential of AI in improving the accuracy, consistency, and efficiency of breast ultrasound diagnosis. 
    more » « less
  5. Abstract The early detection of neurodevelopmental disorders (NDDs) can significantly improve patient outcomes. The differential burden of non-synonymous de novo mutation among NDD cases and controls indicates that de novo coding variation can be used to identify a subset of samples that will likely display an NDD phenotype. Thus, we have developed an approach for the accurate prediction of NDDs with very low false positive rate (FPR) using de novo coding variation for a small subset of cases. We use a shallow neural network that integrates de novo likely gene-disruptive and missense variants, measures of gene constraint, and conservation information to predict a small subset of NDD cases at very low FPR and prioritizes NDD risk genes for future clinical study. 
    more » « less