Advances in large language models (LLMs) have empowered a variety of applications. However, there is still a significant gap in research when it comes to understanding and enhancing the capabilities of LLMs in the field of mental health. In this work, we present a comprehensive evaluation of multiple LLMs on various mental health prediction tasks via online text data, including Alpaca, Alpaca-LoRA, FLAN-T5, GPT-3.5, and GPT-4. We conduct a broad range of experiments, covering zero-shot prompting, few-shot prompting, and instruction fine-tuning. The results indicate a promising yet limited performance of LLMs with zero-shot and few-shot prompt designs for mental health tasks. More importantly, our experiments show that instruction finetuning can significantly boost the performance of LLMs for all tasks simultaneously. Our best-finetuned models, Mental-Alpaca and Mental-FLAN-T5, outperform the best prompt design of GPT-3.5 (25 and 15 times bigger) by 10.9% on balanced accuracy and the best of GPT-4 (250 and 150 times bigger) by 4.8%. They further perform on par with the state-of-the-art task-specific language model. We also conduct an exploratory case study on LLMs' capability on mental health reasoning tasks, illustrating the promising capability of certain models such as GPT-4. We summarize our findings into a set of action guidelines for potential methods to enhance LLMs' capability for mental health tasks. Meanwhile, we also emphasize the important limitations before achieving deployability in real-world mental health settings, such as known racial and gender bias. We highlight the important ethical risks accompanying this line of research. 
                        more » 
                        « less   
                    This content will become publicly available on May 20, 2026
                            
                            Belief and Cognitive State in Text and Speech
                        
                    
    
            This thesis investigates the computational modeling of belief and related cognitive states as expressed in text and speech. Understanding how speakers or authors convey commitment, certainty, and emotions is crucial for language understanding, yet poses significant challenges for current NLP systems. We present a comprehensive study spanning multiple facets of belief prediction. We begin by re-examining the widely used FactBank corpus, correcting a critical projection error and establishing new state-of-the-art results for author-only belief prediction through multi-task learning and error analysis. We then tackle the more complex task of source-and-target belief prediction, introducing a novel generative framework using Flan-T5. This includes developing a structured database representation for FactBank and proposing a linearized tree generation approach, culminating in the BeLeaf system for visualization and analysis, which achieves state-of-the-art performance on both FactBank and the MDP corpus. With the rise of large language models (LLMs), we investigate their zero-shot capabilities for the source-and-target belief task. We propose Unified and Hybrid prompting frameworks, finding that while current LLMs struggle, particularly with nested beliefs, our Hybrid approach paired with reasoning-focused LLMs achieves new state-of-the-art results on FactBank. Finally, we explore the role of multimodality among multiple cognitive states. We present the first study on multimodal belief prediction using the CB-Prosody corpus, demonstrating that integrating audio features via fine-tuned Whisper models significantly improves performance over text-only BERT models. We further introduce Synthetic Audio Data (SAD), showing that even synthetic audio generated by TTS systems provides orthogonal, beneficial signals for various cognitive state tasks (belief, emotion, sentiment). We conclude by presenting OmniVox, the first systematic evaluation of omni-LLMs for zero-shot emotion recognition directly from audio, demonstrating their competitiveness with fine-tuned models and analyzing their acoustic reasoning capabilities. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2125295
- PAR ID:
- 10615180
- Publisher / Repository:
- ERIC
- Date Published:
- Format(s):
- Medium: X
- Institution:
- Stony Brook University
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.more » « less
- 
            The adoption of large language models (LLMs) in healthcare has garnered significant research interest, yet their performance remains limited due to a lack of domain‐specific knowledge, medical reasoning skills, and their unimodal nature, which restricts them to text‐only inputs. To address these limitations, we propose MultiMedRes, a multimodal medical collaborative reasoning framework that simulates human physicians’ communication by incorporating a learner agent to proactively acquire information from domain‐specific expert models. MultiMedRes addresses medical multimodal reasoning problems through three steps i) Inquire: The learner agent decomposes complex medical reasoning problems into multiple domain‐specific sub‐problems; ii) Interact: The agent engages in iterative “ask‐answer” interactions with expert models to obtain domain‐specific knowledge; and iii) Integrate: The agent integrates all the acquired domain‐specific knowledge to address the medical reasoning problems (e.g., identifying the difference of disease levels and abnormality sizes between medical images). We validate the effectiveness of our method on the task of difference visual question answering for X‐ray images. The experiments show that our zero‐shot prediction achieves state‐of‐the‐art performance, surpassing fully supervised methods, which demonstrates that MultiMedRes could offer trustworthy and interpretable assistance to physicians in monitoring the treatment progression of patients, paving the way for effective human–AI interaction and collaboration.more » « less
- 
            Situations and events evoke emotions in humans, but to what extent do they inform the prediction of emotion detection models? This work investigates how well human-annotated emotion triggers correlate with features that models deemed salient in their prediction of emotions. First, we introduce a novel dataset EmoTrigger, consisting of 900 social media posts sourced from three different datasets; these were annotated by experts for emotion triggers with high agreement. Using EmoTrigger, we evaluate the ability of large language models (LLMs) to identify emotion triggers, and conduct a comparative analysis of the features considered important for these tasks between LLMs and fine-tuned models. Our analysis reveals that emotion triggers are largely not considered salient features for emotion prediction models, instead there is intricate interplay between various features and the task of emotion detection.more » « less
- 
            The ability of generative language models (GLMs) to generate text has improved considerably in the last few years, enabling their use for generative data augmentation. In this work, we propose CONDA, an approach to further improve GLM’s ability to generate synthetic data by reformulating data generation as context generation for a given question-answer (QA) pair and leveraging QA datasets for training context generators. Then, we cast downstream tasks into the same question answering format and adapt the fine-tuned context generators to the target task domain. Finally, we use the fine-tuned GLM to generate relevant contexts, which are in turn used as synthetic training data for their corresponding tasks. We perform extensive experiments on multiple classification datasets and demonstrate substantial improvements in performance for both few- and zero-shot settings. Our analysis reveals that QA datasets that require high-level reasoning abilities (e.g., abstractive and common-sense QA datasets) tend to give the best boost in performance in both few-shot and zero-shot settings.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
