skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LSTM Neural Network Assisted Regex Development for Qualitative Coding
Regular expression (regex) based automated qualitative coding helps reduce researchers’ effort in manually coding text data, without sacrificing transparency of the coding process. However, researchers using regex based approaches struggle with low recall or high false negative rate during classifier development. Advanced natural language processing techniques, such as topic modeling, latent semantic analysis and neural network classification models help solve this problem in various ways. The latest advance in this direction is the discovery of the so called “negative reversion set (NRS)”, in which false negative items appear more frequently than in the negative set. This helps regex classifier developers more quickly identify missing items and thus improve classification recall. This paper simulates the use of NRS in real coding scenarios and compares the required manual coding items between NRS sampling and random sampling in the process of classifier refinement. The result using one data set with 50,818 items and six associated qualitative codes shows that, on average, using NRS sampling, the required manual coding size could be reduced by 50% to 63%, comparing with random sampling.  more » « less
Award ID(s):
2100320
PAR ID:
10354430
Author(s) / Creator(s):
; ; ;
Editor(s):
Barany, A.; Damsa, C.
Date Published:
Journal Name:
Advances in Quantitative Ethnography: Fourth International Conference, International Conference on Quantitative Ethnography 2022
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mitrovic, A.; Bosch, N. (Ed.)
    Regular expression (regex) coding has advantages for text analysis. Humans are often able to quickly construct intelligible coding rules with high precision. That is, researchers can identify words and word patterns that correctly classify examples of a particular concept. And, it is often easy to identify false positives and improve the regex classifier so that the positive items are accurately captured. However, ensuring that a regex list is complete is a bigger challenge, because the concepts to be identified in data are often sparsely distributed, which makes it difficult to identify examples of \textit{false negatives}. For this reason, regex-based classifiers suffer by having low recall. That is, it often misses items that should be classified as positive. In this paper, we provide a neural network solution to this problem by identifying a \textit{negative reversion set}, in which false negative items occur much more frequently than in the data set as a whole. Thus, the regex classifier can be more quickly improved by adding missing regexes based on the false negatives found from the negative reversion set. This study used an existing data set collected from a simulation-based learning environment for which researchers had previously defined six codes and developed classifiers with validated regex lists. We randomly constructed incomplete (partial) regex lists and used neural network models to identify negative reversion sets in which the frequency of false negatives increased from a range of 3\\%-8\\% in the full data set to a range of 12\\%-52\\% in the negative reversion set. Based on this finding, we propose an interactive coding mechanism in which human-developed regex classifiers provide input for training machine learning algorithms and machine learning algorithms ``smartly" select highly suspected false negative items for human to more quickly develop regex classifiers. 
    more » « less
  2. The quality of parent–child interaction is critical for child cognitive development. The Dyadic Parent–Child Interaction Coding System (DPICS) is commonly used to assess parent and child behaviors. However, manual annotation of DPICS codes by parent–child interaction therapists is a time-consuming task. To assist therapists in the coding task, researchers have begun to explore the use of artificial intelligence in natural language processing to classify DPICS codes automatically. In this study, we utilized datasets from the DPICS book manual, five families, and an open-source PCIT dataset. To train DPICS code classifiers, we employed the pre-trained fine-tuned model RoBERTa as our learning algorithm. Our study shows that fine-tuning the pre-trained RoBERTa model achieves the highest results compared to other methods in sentence-based DPICS code classification assignments. For the DPICS manual dataset, the overall accuracy was 72.3% (72.2% macro-precision, 70.5% macro-recall, and 69.6% macro-F-score). Meanwhile, for the PCIT dataset, the overall accuracy was 79.8% (80.4% macro-precision, 79.7% macro-recall, and 79.8% macro-F-score), surpassing the previous highest results of 78.3% accuracy (79% precision, 77% recall) averaged over the eight DPICS classes. These results show that fine-tuning the pre-trained RoBERTa model could provide valuable assistance to experts in the labeling process. 
    more » « less
  3. Abstract This research proposes a framework for categorizing the radial tire mode shapes using machine learning (ML) based classification and feature recognition algorithms, advancing the development of a digital twin for tire performance analysis. Tire mode shape categorization is required to identify modal features in a specific frequency range to maximize driving performance and secure safety. However, the mode categorization work requires a lot of manual effort to interpret modes. Therefore, this study suggests an ML-based classification tool to replace the conventional categorization process with two primary objectives: (1) create a database by categorizing the tire mode shapes based on the identified features and (2) develop an ML-based surrogate model to classify the tire mode shapes without manual effort. The feature map of the tire mode shape is built with the Zernike annular moment descriptor (ZAMD). The mode shapes are categorized using the correlation value derived by the modal assurance criteria (MAC) with all ZAMD values for each tire mode shape and subsequently creating the appropriate labels. The decision tree, random forests, and XGBoost, the representative supervised-learning algorithms for classification, are implemented for surrogate model development. The best-performed classifier can categorize the mode shapes without any manual effort with a high accuracy of 99.5%. 
    more » « less
  4. null (Ed.)
    Abstract We systematically compared two coding approaches to generate training datasets for machine learning (ML): (i) a holistic approach based on learning progression levels and (ii) a dichotomous, analytic approach of multiple concepts in student reasoning, deconstructed from holistic rubrics. We evaluated four constructed response assessment items for undergraduate physiology, each targeting five levels of a developing flux learning progression in an ion context. Human-coded datasets were used to train two ML models: (i) an 8-classification algorithm ensemble implemented in the Constructed Response Classifier (CRC), and (ii) a single classification algorithm implemented in LightSide Researcher’s Workbench. Human coding agreement on approximately 700 student responses per item was high for both approaches with Cohen’s kappas ranging from 0.75 to 0.87 on holistic scoring and from 0.78 to 0.89 on analytic composite scoring. ML model performance varied across items and rubric type. For two items, training sets from both coding approaches produced similarly accurate ML models, with differences in Cohen’s kappa between machine and human scores of 0.002 and 0.041. For the other items, ML models trained with analytic coded responses and used for a composite score, achieved better performance as compared to using holistic scores for training, with increases in Cohen’s kappa of 0.043 and 0.117. These items used a more complex scenario involving movement of two ions. It may be that analytic coding is beneficial to unpacking this additional complexity. 
    more » « less
  5. Rotello, C (Ed.)
    Position-specific intrusions of items from prior lists are rare but important phenomena that distinguish broad classes of theory in serial memory. They are uniquely predicted by position coding theories, which assume items on all lists are associated with the same set of codes representing their positions. Activating a position code activates items associated with it in current and prior lists in proportion to their distance from the activated position. Thus, prior list intrusions are most likely to come from the coded position. Alternative “item dependent” theories based on associations between items and contexts built from items have difficulty accounting for the position specificity of prior list intrusions. We tested the position coding account with a position-cued recognition task designed to produce prior list interference. Cuing a position should activate a position code, which should activate items in nearby positions in the current and prior lists. We presented lures from the prior list to test for position-specific activation in response time and error rate; lures from nearby positions should interfere more. We found no evidence for such interference in 10 experiments, falsifying the position coding prediction. We ran two serial recall experiments with the same materials and found position-specific prior list intrusions. These results challenge all theories of serial memory: Position coding theories can explain the prior list intrusions in serial recall and but not the absence of prior list interference in cued recognition. Item dependent theories can explain the absence of prior list interference in cued recognition but cannot explain the occurrence of prior list intrusions in serial recall. 
    more » « less