skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Neural recall network: A neural network solution to low recall problem in regex-based qualitative coding
Regular expression (regex) coding has advantages for text analysis. Humans are often able to quickly construct intelligible coding rules with high precision. That is, researchers can identify words and word patterns that correctly classify examples of a particular concept. And, it is often easy to identify false positives and improve the regex classifier so that the positive items are accurately captured. However, ensuring that a regex list is complete is a bigger challenge, because the concepts to be identified in data are often sparsely distributed, which makes it difficult to identify examples of \textit{false negatives}. For this reason, regex-based classifiers suffer by having low recall. That is, it often misses items that should be classified as positive. In this paper, we provide a neural network solution to this problem by identifying a \textit{negative reversion set}, in which false negative items occur much more frequently than in the data set as a whole. Thus, the regex classifier can be more quickly improved by adding missing regexes based on the false negatives found from the negative reversion set. This study used an existing data set collected from a simulation-based learning environment for which researchers had previously defined six codes and developed classifiers with validated regex lists. We randomly constructed incomplete (partial) regex lists and used neural network models to identify negative reversion sets in which the frequency of false negatives increased from a range of 3\\%-8\\% in the full data set to a range of 12\\%-52\\% in the negative reversion set. Based on this finding, we propose an interactive coding mechanism in which human-developed regex classifiers provide input for training machine learning algorithms and machine learning algorithms ``smartly" select highly suspected false negative items for human to more quickly develop regex classifiers.  more » « less
Award ID(s):
2100320
PAR ID:
10354394
Author(s) / Creator(s):
; ;
Editor(s):
Mitrovic, A.; Bosch, N.
Date Published:
Journal Name:
Proceedings of the 15th International Conference on Educational Data Mining
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Barany, A.; Damsa, C. (Ed.)
    Regular expression (regex) based automated qualitative coding helps reduce researchers’ effort in manually coding text data, without sacrificing transparency of the coding process. However, researchers using regex based approaches struggle with low recall or high false negative rate during classifier development. Advanced natural language processing techniques, such as topic modeling, latent semantic analysis and neural network classification models help solve this problem in various ways. The latest advance in this direction is the discovery of the so called “negative reversion set (NRS)”, in which false negative items appear more frequently than in the negative set. This helps regex classifier developers more quickly identify missing items and thus improve classification recall. This paper simulates the use of NRS in real coding scenarios and compares the required manual coding items between NRS sampling and random sampling in the process of classifier refinement. The result using one data set with 50,818 items and six associated qualitative codes shows that, on average, using NRS sampling, the required manual coding size could be reduced by 50% to 63%, comparing with random sampling. 
    more » « less
  2. Barany, A.; Damsa, C. (Ed.)
    In quantitative ethnography (QE) studies which often involve large da-tasets that cannot be entirely hand-coded by human raters, researchers have used supervised machine learning approaches to develop automated classi-fiers. However, QE researchers are rightly concerned with the amount of human coding that may be required to develop classifiers that achieve the high levels of accuracy that QE studies typically require. In this study, we compare a neural network, a powerful traditional supervised learning ap-proach, with nCoder, an active learning technique commonly used in QE studies, to determine which technique requires the least human coding to produce a sufficiently accurate classifier. To do this, we constructed multi-ple training sets from a large dataset used in prior QE studies and designed a Monte Carlo simulation to test the performance of the two techniques sys-tematically. Our results show that nCoder can achieve high predictive accu-racy with significantly less human-coded data than a neural network. 
    more » « less
  3. In label-noise learning, estimating the transition matrix is a hot topic as the matrix plays an important role in building statistically consistent classifiers. Traditionally, the transition from clean labels to noisy labels (i.e., clean-label transition matrix (CLTM)) has been widely exploited to learn a clean label classifier by employing the noisy data. Motivated by that classifiers mostly output Bayes optimal labels for prediction, in this paper, we study to directly model the transition from Bayes optimal labels to noisy labels (i.e., Bayes-label transition matrix (BLTM)) and learn a classifier to predict Bayes optimal labels. Note that given only noisy data, it is ill-posed to estimate either the CLTM or the BLTM. But favorably, Bayes optimal labels have less uncertainty compared with the clean labels, i.e., the class posteriors of Bayes optimal labels are one-hot vectors while those of clean labels are not. This enables two advantages to estimate the BLTM, i.e., (a) a set of examples with theoretically guaranteed Bayes optimal labels can be collected out of noisy data; (b) the feasible solution space is much smaller. By exploiting the advantages, we estimate the BLTM parametrically by employing a deep neural network, leading to better generalization and superior classification performance. 
    more » « less
  4. Rotello, C (Ed.)
    Position-specific intrusions of items from prior lists are rare but important phenomena that distinguish broad classes of theory in serial memory. They are uniquely predicted by position coding theories, which assume items on all lists are associated with the same set of codes representing their positions. Activating a position code activates items associated with it in current and prior lists in proportion to their distance from the activated position. Thus, prior list intrusions are most likely to come from the coded position. Alternative “item dependent” theories based on associations between items and contexts built from items have difficulty accounting for the position specificity of prior list intrusions. We tested the position coding account with a position-cued recognition task designed to produce prior list interference. Cuing a position should activate a position code, which should activate items in nearby positions in the current and prior lists. We presented lures from the prior list to test for position-specific activation in response time and error rate; lures from nearby positions should interfere more. We found no evidence for such interference in 10 experiments, falsifying the position coding prediction. We ran two serial recall experiments with the same materials and found position-specific prior list intrusions. These results challenge all theories of serial memory: Position coding theories can explain the prior list intrusions in serial recall and but not the absence of prior list interference in cued recognition. Item dependent theories can explain the absence of prior list interference in cued recognition but cannot explain the occurrence of prior list intrusions in serial recall. 
    more » « less
  5. Contrastive learning is a self-supervised representation learning method that achieves milestone performance in various classification tasks. However, due to its unsupervised fashion, it suffers from the false negative sample problem: randomly drawn negative samples that are assumed to have a different label but actually have the same label as the anchor. This deteriorates the performance of contrastive learning as it contradicts the motivation of contrasting semantically similar and dissimilar pairs. This raised the attention and the importance of finding legitimate negative samples, which should be addressed by distinguishing between 1) true vs. false negatives; 2) easy vs. hard negatives. However, previous works were limited to the statistical approach to handle false negative and hard negative samples with hyperparameters tuning. In this paper, we go beyond the statistical approach and explore the connection between hard negative samples and data bias. We introduce a novel debiased contrastive learning method to explore hard negatives by relative difficulty referencing the bias-amplifying counterpart. We propose triplet loss for training a biased encoder that focuses more on easy negative samples. We theoretically show that the triplet loss amplifies the bias in self-supervised representation learning. Finally, we empirically show the proposed method improves downstream classification performance. 
    more » « less