skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 26, 2026

Title: Artificial intelligence in qualitative analysis: a practical guide and reflections based on results from using GPT to analyze interview data in a substance use program
Language-based text provide valuable insights into people’s lived experiences. While traditional qualitative analysis is used to capture these nuances, new paradigms are needed to scale qualitative research effectively. Artificial intelligence presents an unprecedented opportunity to expand the sale of analysis for obtaining such nuances. The study tests the application of GPT-4—a large language modeling—in qualitative data analysis using an existing set of text data derived from 60 qualitative interviews. Specifically, the study provides a practical guide for social and behavioral researchers, illustrating core elements and key processes, demonstrating its reliability by comparing GPT-generated codes with researchers’ codes, and evaluating its capacity for theory-driven qualitative analysis. The study followed a three-step approach: (1) prompt engineering, (2) reliability assessment by comparison of GPT-generated codes with researchers’ codes, and (3) evaluation of theory-driven thematic analysis on psychological constructs. The study underscores the utility of GPT’s capabilities in coding and analyzing text data with established qualitative methods while highlighting the need for qualitative expertise to guide GPT applications. Recommendations for further exploration are also discussed.  more » « less
Award ID(s):
2449011
PAR ID:
10590730
Author(s) / Creator(s):
;
Publisher / Repository:
Springer Nature
Date Published:
Journal Name:
Quality & Quantity
ISSN:
0033-5177
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Qualitative coding, or content analysis, is more than just labeling text: it is a reflexive interpretive practice that shapes research questions, refines theoretical insights, and illuminates subtle social dynamics. As large language models (LLMs) become increasingly adept at nuanced language tasks, questions arise about whether—and how—they can assist in large-scale coding without eroding the interpretive depth that distinguishes qualitative analysis from traditional machine learning and other quantitative approaches to natural language processing. In this paper, we present a hybrid approach that preserves hermeneutic value while incorporating LLMs to scale the application of codes to large data sets that are impractical for manual coding. Our workflow retains the traditional cycle of codebook development and refinement, adding an iterative step to adapt definitions for machine comprehension, before ultimately replacing manual with automated text categorization. We demonstrate how to rewrite code descriptions for LLM-interpretation, as well as how structured prompts and prompting the model to explain its coding decisions (chain-of-thought) can substantially improve fidelity. Empirically, our case study of socio-historical codes highlights the promise of frontier AI language models to reliably interpret paragraph-long passages representative of a humanistic study. Throughout, we emphasize ethical and practical considerations, preserving space for critical reflection, and the ongoing need for human researchers’ interpretive leadership. These strategies can guide both traditional and computational scholars aiming to harness automation effectively and responsibly—maintaining the creative, reflexive rigor of qualitative coding while capitalizing on the efficiency afforded by LLMs. 
    more » « less
  2. This study explores the potential of the large language model GPT-4 as an automated tool for qualitative data analysis by educational researchers, exploring which techniques are most successful for different types of constructs. Specifically, we assess three different prompt engineering strategies — Zero-shot, Few-shot, and Few-shot with contextual information — as well as the use of embeddings. We do so in the context of qualitatively coding three distinct educational datasets: Algebra I semi-personalized tutoring session transcripts, student observations in a game-based learning environment, and debugging behaviours in an introductory programming course. We evaluated the performance of each approach based on its inter-rater agreement with human coders and explored how different methods vary in effectiveness depending on a construct’s degree of clarity, concreteness, objectivity, granularity, and specificity. Our findings suggest that while GPT-4 can code a broad range of constructs, no single method consistently outperforms the others, and the selection of a particular method should be tailored to the specific properties of the construct and context being analyzed. We also found that GPT-4 has the most difficulty with the same constructs than human coders find more difficult to reach inter-rater reliability on. 
    more » « less
  3. Does prompting a large language model (LLM) like GPT-3 with explanations improve in-context learning? We study this question on two NLP tasks that involve reasoning over text, namely question answering and natural language inference. We test the performance of four LLMs on three textual reasoning datasets using prompts that include explanations in multiple different styles. For these tasks, we find that including explanations in the prompts for OPT, GPT-3 (davinci), and InstructGPT (text-davinci-001) only yields small to moderate accuracy improvements over standard few-show learning. However, text-davinci-002 is able to benefit more substantially. We further show that explanations generated by the LLMs may not entail the models' predictions nor be factually grounded in the input, even on simple tasks with extractive explanations. However, these flawed explanations can still be useful as a way to verify LLMs' predictions post-hoc. Through analysis in our three settings, we show that explanations judged by humans to be good---logically consistent with the input and the prediction---more likely cooccur with accurate predictions. Following these observations, we train calibrators using automatically extracted scores that assess the reliability of explanations, allowing us to improve performance post-hoc across all of our datasets. 
    more » « less
  4. In the dynamic field of educational technology, there is an increasing emphasis on incorporating artificial intelligence (AI) into educational settings. Through interviews with mentors and students, this study compares the effectiveness and reliability of AI-generated qualitative codes with human-generated codes in addressing student challenges during Innovation Competitions and Programs (ICPs), such as hackathons, ide competitions, and pitch competitions. While ICPs encourage creativity and innovation, participants often encounter significant challenges. The methodology involves analyzing qualitative responses to student challenges from students involved in the ICPs. Preliminary findings suggest that AI-generated codes offer improved efficiency and objectivity, while human evaluators provide crucial nuanced insights into student challenges. The results showed a high level of agreement between human and AI-generated overall themes that highlight student challenges during ICPs. However, a low agreement was found in mapping AI-generated codes to transcript files. Based on the identified codes and themes, several recommendations were made to make ICP more inclusive learning events for all students. 
    more » « less
  5. In this theory paper, we set out to consider, as a matter of methodological interest, the use of quantitative measures of inter-coder reliability (e.g., percentage agreement, correlation, Cohen’s Kappa, etc.) as necessary and/or sufficient correlates for quality within qualitative research in engineering education. It is well known that the phrase qualitative research represents a diverse body of scholarship conducted across a range of epistemological viewpoints and methodologies. Given this diversity, we concur with those who state that it is ill advised to propose recipes or stipulate requirements for achieving qualitative research validity and reliability. Yet, as qualitative researchers ourselves, we repeatedly find the need to communicate the validity and reliability—or quality—of our work to different stakeholders, including funding agencies and the public. One method for demonstrating quality, which is increasingly used in qualitative research in engineering education, is the practice of reporting quantitative measures of agreement between two or more people who code the same qualitative dataset. In this theory paper, we address this common practice in two ways. First, we identify instances in which inter-coder reliability measures may not be appropriate or adequate for establishing quality in qualitative research. We query research that suggests that the numerical measure itself is the goal of qualitative analysis, rather than the depth and texture of the interpretations that are revealed. Second, we identify complexities or methodological questions that may arise during the process of establishing inter-coder reliability, which are not often addressed in empirical publications. To achieve this purposes, in this paper we will ground our work in a review of qualitative articles, published in the Journal of Engineering Education, that have employed inter-rater or inter-coder reliability as evidence of research validity. In our review, we will examine the disparate measures and scores (from 40% agreement to 97% agreement) used as evidence of quality, as well as the theoretical perspectives within which these measures have been employed. Then, using our own comparative case study research as an example, we will highlight the questions and the challenges that we faced as we worked to meet rigorous standards of evidence in our qualitative coding analysis, We will explain the processes we undertook and the challenges we faced as we assigned codes to a large qualitative data set approached from a post positivist perspective. We will situate these coding processes within the larger methodological literature and, in light of contrasting literature, we will describe the principled decisions we made while coding our own data. We will use this review of qualitative research and our own qualitative research experiences to elucidate inconsistencies and unarticulated issues related to evidence for qualitative validity as a means to generate further discussion regarding quality in qualitative coding processes. 
    more » « less