skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LSTM Neural Network Assisted Regex Development for Qualitative Coding
Regular expression (regex) based automated qualitative coding helps reduce researchers’ effort in manually coding text data, without sacrificing transparency of the coding process. However, researchers using regex based approaches struggle with low recall or high false negative rate during classifier development. Advanced natural language processing techniques, such as topic modeling, latent semantic analysis and neural network classification models help solve this problem in various ways. The latest advance in this direction is the discovery of the so called “negative reversion set (NRS)”, in which false negative items appear more frequently than in the negative set. This helps regex classifier developers more quickly identify missing items and thus improve classification recall. This paper simulates the use of NRS in real coding scenarios and compares the required manual coding items between NRS sampling and random sampling in the process of classifier refinement. The result using one data set with 50,818 items and six associated qualitative codes shows that, on average, using NRS sampling, the required manual coding size could be reduced by 50% to 63%, comparing with random sampling.  more » « less
Award ID(s):
2100320
PAR ID:
10354430
Author(s) / Creator(s):
; ; ;
Editor(s):
Barany, A.; Damsa, C.
Date Published:
Journal Name:
Advances in Quantitative Ethnography: Fourth International Conference, International Conference on Quantitative Ethnography 2022
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mitrovic, A.; Bosch, N. (Ed.)
    Regular expression (regex) coding has advantages for text analysis. Humans are often able to quickly construct intelligible coding rules with high precision. That is, researchers can identify words and word patterns that correctly classify examples of a particular concept. And, it is often easy to identify false positives and improve the regex classifier so that the positive items are accurately captured. However, ensuring that a regex list is complete is a bigger challenge, because the concepts to be identified in data are often sparsely distributed, which makes it difficult to identify examples of \textit{false negatives}. For this reason, regex-based classifiers suffer by having low recall. That is, it often misses items that should be classified as positive. In this paper, we provide a neural network solution to this problem by identifying a \textit{negative reversion set}, in which false negative items occur much more frequently than in the data set as a whole. Thus, the regex classifier can be more quickly improved by adding missing regexes based on the false negatives found from the negative reversion set. This study used an existing data set collected from a simulation-based learning environment for which researchers had previously defined six codes and developed classifiers with validated regex lists. We randomly constructed incomplete (partial) regex lists and used neural network models to identify negative reversion sets in which the frequency of false negatives increased from a range of 3\\%-8\\% in the full data set to a range of 12\\%-52\\% in the negative reversion set. Based on this finding, we propose an interactive coding mechanism in which human-developed regex classifiers provide input for training machine learning algorithms and machine learning algorithms ``smartly" select highly suspected false negative items for human to more quickly develop regex classifiers. 
    more » « less
  2. The quality of parent–child interaction is critical for child cognitive development. The Dyadic Parent–Child Interaction Coding System (DPICS) is commonly used to assess parent and child behaviors. However, manual annotation of DPICS codes by parent–child interaction therapists is a time-consuming task. To assist therapists in the coding task, researchers have begun to explore the use of artificial intelligence in natural language processing to classify DPICS codes automatically. In this study, we utilized datasets from the DPICS book manual, five families, and an open-source PCIT dataset. To train DPICS code classifiers, we employed the pre-trained fine-tuned model RoBERTa as our learning algorithm. Our study shows that fine-tuning the pre-trained RoBERTa model achieves the highest results compared to other methods in sentence-based DPICS code classification assignments. For the DPICS manual dataset, the overall accuracy was 72.3% (72.2% macro-precision, 70.5% macro-recall, and 69.6% macro-F-score). Meanwhile, for the PCIT dataset, the overall accuracy was 79.8% (80.4% macro-precision, 79.7% macro-recall, and 79.8% macro-F-score), surpassing the previous highest results of 78.3% accuracy (79% precision, 77% recall) averaged over the eight DPICS classes. These results show that fine-tuning the pre-trained RoBERTa model could provide valuable assistance to experts in the labeling process. 
    more » « less
  3. This work investigates how different forms of input elicitation obtained from crowdsourcing can be utilized to improve the quality of inferred labels for image classification tasks, where an image must be labeled as either positive or negative depending on the presence/absence of a specified object. Five types of input elicitation methods are tested: binary classification (positive or negative); the ( x, y )-coordinate of the position participants believe a target object is located; level of confidence in binary response (on a scale from 0 to 100%); what participants believe the majority of the other participants' binary classification is; and participant's perceived difficulty level of the task (on a discrete scale). We design two crowdsourcing studies to test the performance of a variety of input elicitation methods and utilize data from over 300 participants. Various existing voting and machine learning (ML) methods are applied to make the best use of these inputs. In an effort to assess their performance on classification tasks of varying difficulty, a systematic synthetic image generation process is developed. Each generated image combines items from the MPEG-7 Core Experiment CE-Shape-1 Test Set into a single image using multiple parameters (e.g., density, transparency, etc.) and may or may not contain a target object. The difficulty of these images is validated by the performance of an automated image classification method. Experiment results suggest that more accurate results can be achieved with smaller training datasets when both the crowdsourced binary classification labels and the average of the self-reported confidence values in these labels are used as features for the ML classifiers. Moreover, when a relatively larger properly annotated dataset is available, in some cases augmenting these ML algorithms with the results (i.e., probability of outcome) from an automated classifier can achieve even higher performance than what can be obtained by using any one of the individual classifiers. Lastly, supplementary analysis of the collected data demonstrates that other performance metrics of interest, namely reduced false-negative rates, can be prioritized through special modifications of the proposed aggregation methods. 
    more » « less
  4. The overall purpose of this paper is to demonstrate how data preprocessing, training size variation, and subsampling can dynamically change the performance metrics of imbalanced text classification. The methodology encompasses using two different supervised learning classification approaches of feature engineering and data preprocessing with the use of five machine learning classifiers, five imbalanced sampling techniques, specified intervals of training and subsampling sizes, statistical analysis using R and tidyverse on a dataset of 1000 portable document format files divided into five labels from the World Health Organization Coronavirus Research Downloadable Articles of COVID-19 papers and PubMed Central databases of non-COVID-19 papers for binary classification that affects the performance metrics of precision, recall, receiver operating characteristic area under the curve, and accuracy. One approach that involves labeling rows of sentences based on regular expressions significantly improved the performance of imbalanced sampling techniques verified by performing statistical analysis using a t-test documenting performance metrics of iterations versus another approach that automatically labels the sentences based on how the documents are organized into positive and negative classes. The study demonstrates the effectiveness of ML classifiers and sampling techniques in text classification datasets, with different performance levels and class imbalance issues observed in manual and automatic methods of data processing. 
    more » « less
  5. Abstract This research proposes a framework for categorizing the radial tire mode shapes using machine learning (ML) based classification and feature recognition algorithms, advancing the development of a digital twin for tire performance analysis. Tire mode shape categorization is required to identify modal features in a specific frequency range to maximize driving performance and secure safety. However, the mode categorization work requires a lot of manual effort to interpret modes. Therefore, this study suggests an ML-based classification tool to replace the conventional categorization process with two primary objectives: (1) create a database by categorizing the tire mode shapes based on the identified features and (2) develop an ML-based surrogate model to classify the tire mode shapes without manual effort. The feature map of the tire mode shape is built with the Zernike annular moment descriptor (ZAMD). The mode shapes are categorized using the correlation value derived by the modal assurance criteria (MAC) with all ZAMD values for each tire mode shape and subsequently creating the appropriate labels. The decision tree, random forests, and XGBoost, the representative supervised-learning algorithms for classification, are implemented for surrogate model development. The best-performed classifier can categorize the mode shapes without any manual effort with a high accuracy of 99.5%. 
    more » « less