Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research. 
                        more » 
                        « less   
                    
                            
                            Assessing Student Explanations with Large Language Models Using Fine-Tuning and Few-Shot Learning
                        
                    
    
            The practice of soliciting self-explanations from students is widely recognized for its pedagogical benefits. However, the labor-intensive effort required to manually assess students’ explanations makes it impractical for classroom settings. As a result, many current solutions to gauge students’ understanding during class are often limited to multiple choice or fill-in-the-blank questions, which are less effective at exposing misconceptions or helping students to understand and integrate new concepts. Recent advances in large language models (LLMs) present an opportunity to assess student explanations in real-time, making explanation-based classroom response systems feasible for implementation. In this work, we investigate LLM-based approaches for assessing the correctness of students’ explanations in response to undergraduate computer science questions. We investigate alternative prompting approaches for multiple LLMs (i.e., Llama 2, GPT-3.5, and GPT-4) and compare their performance to FLAN-T5 models trained in a fine-tuning manner. The results suggest that the highest accuracy and weighted F1 score were achieved by fine-tuning FLAN-T5, while an in-context learning approach with GPT-4 attains the highest macro F1 score. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2111473
- PAR ID:
- 10522886
- Editor(s):
- Kochmar, E; Bexte, M; Burstein, J; Horbach, A; Laarmann-Quante, R; Tack, A; Yaneva, V; Yuan, Z
- Publisher / Repository:
- Proceedings of the 19th Workshop on Innovative Use of NLP for Building Educational Applications, Association for Computational Linguistics
- Date Published:
- Page Range / eLocation ID:
- 403-413
- Format(s):
- Medium: X
- Location:
- Mexico City, Mexico
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            We report explorations into prompt engineering with large pre-trained language models that were not fine-tuned to solve the legal entailment task (Task 4) of the 2023 COLIEE competition. Our most successful strategy used simple text similarity measures to retrieve articles and queries from the training set. We report on our efforts to optimize performance with both OpenAI’s GPT-4 and FLaN-T5. We also used an ensemble approach to find the best combination of models and prompts. Finally, we analyze our results and suggest ideas for future improvements.more » « less
- 
            Learning outcomes are clear and concise statements that describe what students should be able to do or know at the end of a particular course. These statements are crucial in instructional planning, curriculum development, and assessment of student progress and learning. Although there is no universal guidance on how to develop learning outcomes, Bloom’s taxonomy is one widely used framework that helps instructors develop outcomes that reflect different levels of thinking, from basic remembering to creative problem-solving. This study investigates the potential of generative AI, specifically GPT-4, in classifying course learning outcomes according to their respective cognitive levels within the revised Bloom’s taxonomy. To assess the effectiveness of GenAI, we conducted a comparative study using a dataset of 1000 annotated learning outcomes. We tested multiple prompt engineering strategies, including zero-shot, few-shot, chain-of-thought, rhetorical situation, and multiple binary questions, leveraging GPT-4. Classification performance was evaluated using accuracy, Cohen’s κ, and F1-score. The results indicate that the prompt incorporating rhetorical context and domain-specific knowledge achieved the highest classification performance, while the multiple binary question approach underperformed even compared to the zero-shot method. Furthermore, we compared the best-performing prompting strategy with a state-of-the-art classification model, BERT. Although the fine-tuned BERT model showed superior performance, prompt-based classification exhibited moderate to substantial agreement with expert annotations. Overall, this article demonstrates the potential of leveraging large language models to advance both theoretical understanding and practical application within the field of education and natural language processing.more » « less
- 
            Martin Fred; Norouzi, Narges; Rosenthal, Stephanie (Ed.)This paper examines the use of LLMs to support the grading and explanation of short-answer formative assessments in K12 science topics. While significant work has been done on programmatically scoring well-structured student assessments in math and computer science, many of these approaches produce a numerical score and stop short of providing teachers and students with explanations for the assigned scores. In this paper, we investigate few-shot, in-context learning with chain-of-thought reasoning and active learning using GPT-4 for automated assessment of students’ answers in a middle school Earth Science curriculum. Our findings from this human-in-the-loop approach demonstrate success in scoring formative assessment responses and in providing meaningful explanations for the assigned score. We then perform a systematic analysis of the advantages and limitations of our approach. This research provides insight into how we can use human-in-the-loop methods for the continual improvement of automated grading for open-ended science assessments.more » « less
- 
            Social media has revolutionized communication, allowing people worldwide to connect and interact instantly. However, it has also led to increases in cyberbullying, which poses a significant threat to children and adolescents globally, affecting their mental health and well-being. It is critical to accurately detect the roles of individuals involved in cyberbullying incidents to effectively address the issue on a large scale. This study explores the use of machine learning models to detect the roles involved in cyberbullying interactions. After examining the AMiCA dataset and addressing class imbalance issues, we evaluate the performance of various models built with four underlying LLMs (i.e. BERT, RoBERTa, T5, and GPT-2) for role detection. Our analysis shows that oversampling techniques help improve model performance. The best model, a fine-tuned RoBERTa using oversampled data, achieved an overall F1 score of 83.5%, increasing to 89.3% after applying a prediction threshold. The top-2 F1 score without thresholding was 95.7%. Our method outperforms previously proposed models. After investigating the per-class model performance and confidence scores, we show that the models perform well in classes with more samples and less contextual confusion (e.g. Bystander Other), but struggle with classes with fewer samples (e.g. Bystander Assistant) and more contextual ambiguity (e.g. Harasser and Victim). This work highlights current strengths and limitations in the development of accurate models with limited data and complex scenarios.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    