skip to main content


Title: Generative AI tools can enhance climate literacy but must be checked for biases and inaccuracies
Abstract

In the face of climate change, climate literacy is becoming increasingly important. With wide access to generative AI tools, such as OpenAI’s ChatGPT, we explore the potential of AI platforms for ordinary citizens asking climate literacy questions. Here, we focus on a global scale and collect responses from ChatGPT (GPT-3.5 and GPT-4) on climate change-related hazard prompts over multiple iterations by utilizing the OpenAI’s API and comparing the results with credible hazard risk indices. We find a general sense of agreement in comparisons and consistency in ChatGPT over the iterations. GPT-4 displayed fewer errors than GPT-3.5. Generative AI tools may be used in climate literacy, a timely topic of importance, but must be scrutinized for potential biases and inaccuracies moving forward and considered in a social context. Future work should identify and disseminate best practices for optimal use across various generative AI tools.

 
more » « less
NSF-PAR ID:
10503579
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Communications Earth & Environment
Volume:
5
Issue:
1
ISSN:
2662-4435
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Importance

    The study highlights the potential of large language models, specifically GPT-3.5 and GPT-4, in processing complex clinical data and extracting meaningful information with minimal training data. By developing and refining prompt-based strategies, we can significantly enhance the models’ performance, making them viable tools for clinical NER tasks and possibly reducing the reliance on extensive annotated datasets.

    Objectives

    This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance.

    Materials and Methods

    We evaluated these models on 2 clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) to identify nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT.

    Results

    Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all 4 components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed.

    Discussion

    The study’s findings suggest a promising direction in leveraging LLMs for clinical NER tasks. However, while the performance of GPT models improved with task-specific prompts, there's a need for further development and refinement. LLMs like GPT-4 show potential in achieving close performance to state-of-the-art models like BioClinicalBERT, but they still require careful prompt engineering and understanding of task-specific knowledge. The study also underscores the importance of evaluation schemas that accurately reflect the capabilities and performance of LLMs in clinical settings.

    Conclusion

    While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

     
    more » « less
  2. This paper presents an innovative testing framework, testFAILS, designed for the rigorous evaluation of AI Linguistic Systems, with a particular emphasis on various iterations of ChatGPT. Leveraging orthogonal array coverage, this framework provides a robust mechanism for assessing AI systems, addressing the critical question, "How should we evaluate AI?" While the Turing test has traditionally been the benchmark for AI evaluation, we argue that current publicly available chatbots, despite their rapid advancements, have yet to meet this standard. However, the pace of progress suggests that achieving Turing test-level performance may be imminent. In the interim, the need for effective AI evaluation and testing methodologies remains paramount. Our research, which is ongoing, has already validated several versions of ChatGPT, and we are currently conducting comprehensive testing on the latest models, including ChatGPT-4, Bard and Bing Bot, and the LLaMA model. The testFAILS framework is designed to be adaptable, ready to evaluate new bot versions as they are released. Additionally, we have tested available chatbot APIs and developed our own application, AIDoctor, utilizing the ChatGPT-4 model and Microsoft Azure AI technologies. 
    more » « less
  3. Generative AI, exemplified in ChatGPT, Dall-E 2, and Stable Diffusion, are exciting new applications consuming growing quantities of computing. We study the compute, energy, and carbon impacts of generative AI inference. Using ChatGPT as an exemplar, we create a workload model and compare request direction approaches (Local, Balance, CarbonMin), assessing their power use and carbon impacts. Our workload model shows that for ChatGPT-like services, in- ference dominates emissions, in one year producing 25x the carbon-emissions of training GPT-3. The workload model characterizes user experience, and experiments show that carbon emissions-aware algorithms (CarbonMin) can both maintain user experience and reduce carbon emissions dramatically (35%). We also consider a future scenario (2035 workload and power grids), and show that CarbonMin can reduce emissions by 56%. In both cases, the key is intelligent direction of requests to locations with low-carbon power. Combined with hardware technology advances, CarbonMin can keep emissions increase to only 20% compared to 2022 levels for 55x greater workload. Finally we consider datacenter headroom to increase effectiveness of shifting. With headroom, CarbonMin reduces 2035 emissions by 71%. 
    more » « less
  4. The purpose of this study was to gauge the attitudes towards artificial intelligence (AI) use in the science classroom by science teachers at the start of generative AI chatbot popularity (March 2023). The lens of distributed cognition afforded an opportunity to gather thoughts, opinions, and perceptions from 24 secondary science educators as well as three AI chatbots. Purposeful sampling was used to form the initial science educator focus groups, and both human and AI participants individually responded to an attitudes survey as well as an epistemic cognition questionnaire over a 2-week period. In addition to participating in the study, AI—specifically OpenAI’s ChatGPT—was used to create two of the three survey instruments and served as an analysis tool for the qualitative results of this mixed-methods study. Results from the qualitative data suggest that secondary science educators are cautiously optimistic about the inclusion of AI in the classroom; however, there is a need to modify teacher preparation to incorporate AI training. Further, ethical concerns were identified about plagiarism, knowledge generation, and what constitutes original thinking with AI use. A one-way ANOVA revealed that there was a significant effect of subject taught on attitudes towards AI in the classroom p < 0.05 level for the four conditions: F(3, 23) = 6.743, p = .002. The partial eta squared of 0.47 indicates a large effect size with practical significance. This study serves as an artifact of knowledge about knowledge at the beginning of a technological revolution. 
    more » « less
  5. Artificial Intelligence (AI) is a transformative force in communication and messaging strategy, with potential to disrupt traditional approaches. Large language models (LLMs), a form of AI, are capable of generating high-quality, humanlike text. We investigate the persuasive quality of AI-generated messages to understand how AI could impact public health messaging. Specifically, through a series of studies designed to characterize and evaluate generative AI in developing public health messages, we analyze COVID-19 pro-vaccination messages generated by GPT-3, a state-of-the-art instantiation of a large language model. Study 1 is a systematic evaluation of GPT-3's ability to generate pro-vaccination messages. Study 2 then observed peoples' perceptions of curated GPT-3-generated messages compared to human-authored messages released by the CDC (Centers for Disease Control and Prevention), finding that GPT-3 messages were perceived as more effective, stronger arguments, and evoked more positive attitudes than CDC messages. Finally, Study 3 assessed the role of source labels on perceived quality, finding that while participants preferred AI-generated messages, they expressed dispreference for messages that were labeled as AI-generated. The results suggest that, with human supervision, AI can be used to create effective public health messages, but that individuals prefer their public health messages to come from human institutions rather than AI sources. We propose best practices for assessing generative outputs of large language models in future social science research and ways health professionals can use AI systems to augment public health messaging.

     
    more » « less