skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 8, 2026

Title: Streamlining Field Note Analysis: Leveraging GPT for Further Insights
As an integral part of qualitative research inquiry, field notes provide important data from researchers embedded in research sites. However, field notes can vary significantly, influenced by the researchers' immersion in the field, prior knowledge, beliefs, interests, and perspectives. As consequence, their interpretation presents significant challenges. This study offers a preliminary investigation into the potential of using large language models to assist researchers with the analysis and interpretation of field notes data. Our methodology consisted of two phases. First, a researcher deductively coded field notes of six classroom implementations of a novel elementary-level mathematics curriculum. In the second phase, we prompted ChatGPT-4 to code the same field notes, using the codebook, definitions, examples, and deductive coding approach employed by the researcher. We also prompted Chatgpt to provide justifications of its coding decisions We then, calculated agreements and disagreements between ChatGPT and the researcher, organized the data in a contingency table, computed Cohen's Kappa, structured the data into a confusion matrix; and using the researcher’s coding as the “gold standard”, we calculated performance measures, specifically: Accuracy, Precision, Recall, and F1 Score. Our findings revealed that while the researcher and ChatGPT appeared to generally agree on the frequency in applying the different codes, overall agreement, as measured by Cohen’s Kappa was low. In contrast, using measures from information science at the code level revealed more nuanced results. Moreover, coupled with ChatGPT justifications of coding decisions, these findings provided insights than can help support the iterative improvement of codebooks.  more » « less
Award ID(s):
2031382
PAR ID:
10617343
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
American Educational Research Association
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    A qualitative study was conducted to understand how middle and high school students with visual impairments (VI) engage in Science, Technology, Engineering and Mathematics (STEM). The Readiness Academy, a Project-Based Learning (PBL) intervention, was designed to provide a week-long, immersive, outdoor, and inquiry-based science education program to students with VI. We analyzed 187 photographs, camp associate intern notes, and researcher memos first using emotion coding, followed by process coding to structure initial codes and categories into seven research activities. We used axial coding as a secondary cycle coding method to determine four consistent themes across all research activities: apprenticeship, collaboration, accessibility, and independence. We found that the inclusion of purposeful accessibility, such as assistive technology and multisensory experiences, supported how students with VI engaged in STEM education. The findings reflect how students dynamically fulfilled roles as apprentices, collaborative members, and independent researchers within the program’s context of PBL and outdoor science education. 
    more » « less
  2. In this theory paper, we set out to consider, as a matter of methodological interest, the use of quantitative measures of inter-coder reliability (e.g., percentage agreement, correlation, Cohen’s Kappa, etc.) as necessary and/or sufficient correlates for quality within qualitative research in engineering education. It is well known that the phrase qualitative research represents a diverse body of scholarship conducted across a range of epistemological viewpoints and methodologies. Given this diversity, we concur with those who state that it is ill advised to propose recipes or stipulate requirements for achieving qualitative research validity and reliability. Yet, as qualitative researchers ourselves, we repeatedly find the need to communicate the validity and reliability—or quality—of our work to different stakeholders, including funding agencies and the public. One method for demonstrating quality, which is increasingly used in qualitative research in engineering education, is the practice of reporting quantitative measures of agreement between two or more people who code the same qualitative dataset. In this theory paper, we address this common practice in two ways. First, we identify instances in which inter-coder reliability measures may not be appropriate or adequate for establishing quality in qualitative research. We query research that suggests that the numerical measure itself is the goal of qualitative analysis, rather than the depth and texture of the interpretations that are revealed. Second, we identify complexities or methodological questions that may arise during the process of establishing inter-coder reliability, which are not often addressed in empirical publications. To achieve this purposes, in this paper we will ground our work in a review of qualitative articles, published in the Journal of Engineering Education, that have employed inter-rater or inter-coder reliability as evidence of research validity. In our review, we will examine the disparate measures and scores (from 40% agreement to 97% agreement) used as evidence of quality, as well as the theoretical perspectives within which these measures have been employed. Then, using our own comparative case study research as an example, we will highlight the questions and the challenges that we faced as we worked to meet rigorous standards of evidence in our qualitative coding analysis, We will explain the processes we undertook and the challenges we faced as we assigned codes to a large qualitative data set approached from a post positivist perspective. We will situate these coding processes within the larger methodological literature and, in light of contrasting literature, we will describe the principled decisions we made while coding our own data. We will use this review of qualitative research and our own qualitative research experiences to elucidate inconsistencies and unarticulated issues related to evidence for qualitative validity as a means to generate further discussion regarding quality in qualitative coding processes. 
    more » « less
  3. Abstract BackgroundSystematic literature reviews (SLRs) are foundational for synthesizing evidence across diverse fields and are especially important in guiding research and practice in health and biomedical sciences. However, they are labor intensive due to manual data extraction from multiple studies. As large language models (LLMs) gain attention for their potential to automate research tasks and extract basic information, understanding their ability to accurately extract explicit data from academic papers is critical for advancing SLRs. ObjectiveOur study aimed to explore the capability of LLMs to extract both explicitly outlined study characteristics and deeper, more contextual information requiring nuanced evaluations, using ChatGPT (GPT-4). MethodsWe screened the full text of a sample of COVID-19 modeling studies and analyzed three basic measures of study settings (ie, analysis location, modeling approach, and analyzed interventions) and three complex measures of behavioral components in models (ie, mobility, risk perception, and compliance). To extract data on these measures, two researchers independently extracted 60 data elements using manual coding and compared them with the responses from ChatGPT to 420 queries spanning 7 iterations. ResultsChatGPT’s accuracy improved as prompts were refined, showing improvements of 33% and 23% between the initial and final iterations for extracting study settings and behavioral components, respectively. In the initial prompts, 26 (43.3%) of 60 ChatGPT responses were correct. However, in the final iteration, ChatGPT extracted 43 (71.7%) of the 60 data elements, showing better performance in extracting explicitly stated study settings (28/30, 93.3%) than in extracting subjective behavioral components (15/30, 50%). Nonetheless, the varying accuracy across measures highlighted its limitations. ConclusionsOur findings underscore LLMs’ utility in extracting basic as well as explicit data in SLRs by using effective prompts. However, the results reveal significant limitations in handling nuanced, subjective criteria, emphasizing the necessity for human oversight. 
    more » « less
  4. Fields in the social sciences, such as education research, have started to expand the use of computer-based research methods to supplement traditional research approaches. Natural language processing techniques, such as topic modeling, may support qualitative data analysis by providing early categories that researchers may interpret and refine. This study contributes to this body of research and answers the following research questions: (RQ1) What is the relative coverage of the latent Dirichlet allocation (LDA) topic model and human coding in terms of the breadth of the topics/themes extracted from the text collection? (RQ2) What is the relative depth or level of detail among identified topics using LDA topic models and human coding approaches? A dataset of student reflections was qualitatively analyzed using LDA topic modeling and human coding approaches, and the results were compared. The findings suggest that topic models can provide reliable coverage and depth of themes present in a textual collection comparable to human coding but require manual interpretation of topics. The breadth and depth of human coding output is heavily dependent on the expertise of coders and the size of the collection; these factors are better handled in the topic modeling approach. 
    more » « less
  5. Skarnitzl, R. & (Ed.)
    While motion capture is rapidly becoming the gold standard for research on the intricacies of co-speech gesture and its relationship to speech, traditional marker-based motion capture technology is not always feasible, meaning researchers must code video data manually. We compare two methods for coding co-speech gestures of the hands and arms in video data of spontaneous speech: manual coding and semi-automated coding using OpenPose, a markerless motion capture software. We provide a comparison of the temporal alignment of gesture apexes based on video recordings of interviews with speakers of Medumba (Grassfields Bantu). Our results show a close correlation between the computationally calculated apexes and our hand-annotated apexes, suggesting that both methods are equally valid for coding video data. The use of markerless motion capture technology for gesture coding will enable more rapid coding of manual gestures, while still allowing 
    more » « less