skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 26, 2026

Title: Artificial intelligence in qualitative analysis: a practical guide and reflections based on results from using GPT to analyze interview data in a substance use program
Language-based text provide valuable insights into people’s lived experiences. While traditional qualitative analysis is used to capture these nuances, new paradigms are needed to scale qualitative research effectively. Artificial intelligence presents an unprecedented opportunity to expand the sale of analysis for obtaining such nuances. The study tests the application of GPT-4—a large language modeling—in qualitative data analysis using an existing set of text data derived from 60 qualitative interviews. Specifically, the study provides a practical guide for social and behavioral researchers, illustrating core elements and key processes, demonstrating its reliability by comparing GPT-generated codes with researchers’ codes, and evaluating its capacity for theory-driven qualitative analysis. The study followed a three-step approach: (1) prompt engineering, (2) reliability assessment by comparison of GPT-generated codes with researchers’ codes, and (3) evaluation of theory-driven thematic analysis on psychological constructs. The study underscores the utility of GPT’s capabilities in coding and analyzing text data with established qualitative methods while highlighting the need for qualitative expertise to guide GPT applications. Recommendations for further exploration are also discussed.  more » « less
Award ID(s):
2449011
PAR ID:
10590730
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Nature
Date Published:
Journal Name:
Quality & Quantity
ISSN:
0033-5177
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Qualitative coding, or content analysis, is more than just labeling text: it is a reflexive interpretive practice that shapes research questions, refines theoretical insights, and illuminates subtle social dynamics. As large language models (LLMs) become increasingly adept at nuanced language tasks, questions arise about whether—and how—they can assist in large-scale coding without eroding the interpretive depth that distinguishes qualitative analysis from traditional machine learning and other quantitative approaches to natural language processing. In this paper, we present a hybrid approach that preserves hermeneutic value while incorporating LLMs to scale the application of codes to large data sets that are impractical for manual coding. Our workflow retains the traditional cycle of codebook development and refinement, adding an iterative step to adapt definitions for machine comprehension, before ultimately replacing manual with automated text categorization. We demonstrate how to rewrite code descriptions for LLM-interpretation, as well as how structured prompts and prompting the model to explain its coding decisions (chain-of-thought) can substantially improve fidelity. Empirically, our case study of socio-historical codes highlights the promise of frontier AI language models to reliably interpret paragraph-long passages representative of a humanistic study. Throughout, we emphasize ethical and practical considerations, preserving space for critical reflection, and the ongoing need for human researchers’ interpretive leadership. These strategies can guide both traditional and computational scholars aiming to harness automation effectively and responsibly—maintaining the creative, reflexive rigor of qualitative coding while capitalizing on the efficiency afforded by LLMs. 
    more » « less
  2. Large Language Models (LLMs) have gained attention in research and industry, aiming to streamline processes and enhance text analysis performance. Thematic Analysis (TA), a prevalent qualitative method for analyzing interview content, often requires at least two human experts to review and analyze data. This study demonstrates the feasibility of LLM-Assisted Thematic Analysis (LATA) using GPT-4 and Gemini. Specifically, we conducted semi-structured interviews with 14 researchers to gather insights on their experiences generating and analyzing Online Social Network (OSN) communications datasets. Following Braun and Clarke's six-phase TA framework with an inductive approach, we initially analyzed our interview transcripts with human experts. Subsequently, we iteratively designed prompts to guide LLMs through a similar process. We compare and discuss the manually analyzed outcomes with responses generated by LLMs and achieve a cosine similarity score up to 0.76, demonstrating a promising prospect for LATA. Additionally, the study delves into researchers' experiences navigating the complexities of collecting and analyzing OSN data, offering recommendations for future research and application designers. 
    more » « less
  3. This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies — Zero-shot, Few-shot, and Few-shot with contextual information — as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct’s degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on. 
    more » « less
  4. Does prompting a large language model (LLM) like GPT-3 with explanations improve in-context learning? We study this question on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. We test the performance of four LLMs on three textual reasoning datasets using prompts that include explanations in multiple different styles. For these tasks, we find that including explanations in the prompts for OPT, GPT-3 (davinci), and InstructGPT (text-davinci-001) only yields small to moderate accuracy improvements over standard few-show learning. However, text-davinci-002 is able to benefit more substantially. We further show that explanations generated by the LLMs may not entail the models' predictions nor be factually grounded in the input, even on simple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc. Through analysis in our three settings, we show that explanations judged by humans to be good---logically consistent with the input and the prediction---more likely cooccur with accurate predictions. Following these observations, we train calibrators using automatically extracted scores that assess the reliability of explanations, allowing us to improve performance post-hoc across all of our datasets. 
    more » « less
  5. In the dynamic field of educational technology, there is an increasing emphasis on incorporating artificial intelligence (AI) into educational settings. Through interviews with mentors and students, this study compares the effectiveness and reliability of AI-generated qualitative codes with human-generated codes in addressing student challenges during Innovation Competitions and Programs (ICPs), such as hackathons, ide competitions, and pitch competitions. While ICPs encourage creativity and innovation, participants often encounter significant challenges. The methodology involves analyzing qualitative responses to student challenges from students involved in the ICPs. Preliminary findings suggest that AI-generated codes offer improved efficiency and objectivity, while human evaluators provide crucial nuanced insights into student challenges. The results showed a high level of agreement between human and AI-generated overall themes that highlight student challenges during ICPs. However, a low agreement was found in mapping AI-generated codes to transcript files. Based on the identified codes and themes, several recommendations were made to make ICP more inclusive learning events for all students. 
    more » « less