skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Does Active Learning Reduce Human Coding?: A Systematic Comparison of a Neural Network with nCoder
In quantitative ethnography (QE) studies which often involve large da-tasets that cannot be entirely hand-coded by human raters, researchers have used supervised machine learning approaches to develop automated classi-fiers. However, QE researchers are rightly concerned with the amount of human coding that may be required to develop classifiers that achieve the high levels of accuracy that QE studies typically require. In this study, we compare a neural network, a powerful traditional supervised learning ap-proach, with nCoder, an active learning technique commonly used in QE studies, to determine which technique requires the least human coding to produce a sufficiently accurate classifier. To do this, we constructed multi-ple training sets from a large dataset used in prior QE studies and designed a Monte Carlo simulation to test the performance of the two techniques sys-tematically. Our results show that nCoder can achieve high predictive accu-racy with significantly less human-coded data than a neural network.  more » « less
Award ID(s):
2100320
PAR ID:
10354410
Author(s) / Creator(s):
; ; ; ;
Editor(s):
Barany, A.; Damsa, C.
Date Published:
Journal Name:
Advances in Quantitative Ethnography: Fourth International Conference, International Conference on Quantitative Ethnography 2022
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mitrovic, A.; Bosch, N. (Ed.)
    Regular expression (regex) coding has advantages for text analysis. Humans are often able to quickly construct intelligible coding rules with high precision. That is, researchers can identify words and word patterns that correctly classify examples of a particular concept. And, it is often easy to identify false positives and improve the regex classifier so that the positive items are accurately captured. However, ensuring that a regex list is complete is a bigger challenge, because the concepts to be identified in data are often sparsely distributed, which makes it difficult to identify examples of \textit{false negatives}. For this reason, regex-based classifiers suffer by having low recall. That is, it often misses items that should be classified as positive. In this paper, we provide a neural network solution to this problem by identifying a \textit{negative reversion set}, in which false negative items occur much more frequently than in the data set as a whole. Thus, the regex classifier can be more quickly improved by adding missing regexes based on the false negatives found from the negative reversion set. This study used an existing data set collected from a simulation-based learning environment for which researchers had previously defined six codes and developed classifiers with validated regex lists. We randomly constructed incomplete (partial) regex lists and used neural network models to identify negative reversion sets in which the frequency of false negatives increased from a range of 3\\%-8\\% in the full data set to a range of 12\\%-52\\% in the negative reversion set. Based on this finding, we propose an interactive coding mechanism in which human-developed regex classifiers provide input for training machine learning algorithms and machine learning algorithms ``smartly" select highly suspected false negative items for human to more quickly develop regex classifiers. 
    more » « less
  2. This Research paper discusses the opportunities that utilizing a computer program can present in analyzing large amounts of qualitative data collected through a survey tool. When working with longitudinal qualitative data, there are many challenges that researchers face. The coding scheme may evolve over time requiring re-coding of early data. There may be long periods of time between data analysis. Typically, multiple researchers will participate in the coding, but this may introduce bias or inconsistencies. Ideally the same researchers would be analyzing the data, but often there is some turnover in the team, particularly when students assist with the coding. Computer programs can enable automated or semi-automated coding helping to reduce errors and inconsistencies in the coded data. In this study, a modeling survey was developed to assess student awareness of model types and administered in four first-year engineering courses across the three universities over the span of three years. The data collected from this survey consists of over 4,000 students’ open-ended responses to three questions about types of models in science, technology, engineering, and mathematics (STEM) fields. A coding scheme was developed to identify and categorize model types in student responses. Over two years, two undergraduate researchers analyzed a total of 1,829 students’ survey responses after ensuring intercoder reliability was greater than 80% for each model category. However, with much data remaining to be coded, the research team developed a MATLAB program to automatically implement the coding scheme and identify the types of models students discussed in their responses. MATLAB coded results were compared to human-coded results (n = 1,829) to assess reliability; results matched between 81%-99% for the different model categories. Furthermore, the reliability of the MATLAB coded results are within the range of the interrater reliability measured between the 2 undergraduate researchers (86-100% for the five model categories). With good reliability of the program, all 4,358 survey responses were coded; results showing the number and types of models identified by students are presented in the paper. 
    more » « less
  3. Although seismic industry has been investigating decades on solving the first break picking problems automatically, there are still enormous challenges during the investigation. Even till today, there are not solid solutions to avoid human labors to manually pick data by geophysicists. With the raise of deep learning and powerful hardware, many of those challenges can be overcome. In this work, we propose a deep semi-supervised neural network to achieve automatic picking for the first break in seismic data. The network is designed to perform with both unlabeled data and a limited amount of real data with labels. Initial feature representation is learning in a discriminative unsupervised manner on real datasets without labels. Since no assumptions are made with regard to the difference of underlying distributions between the synthetic and real data, our model has more marginal gain to compensate for the distribution drifting compare to the supervised learning models. In addition, the network is capable of updating itself through continuous learning. The system is able to identify labeling anomalies onsite and update the model through active learning. In simulation, we show our proposed deep semi-supervised neural network can achieve high accuracy on first break picking. Comparing with the supervised neural networks, our proposed network shows the advantage on using both labeled and unlabeled data set to achieve higher accuracy. 
    more » « less
  4. PurposeTo develop a strategy for training a physics‐guided MRI reconstruction neural network without a database of fully sampled data sets. MethodsSelf‐supervised learning via data undersampling (SSDU) for physics‐guided deep learning reconstruction partitions available measurements into two disjoint sets, one of which is used in the data consistency (DC) units in the unrolled network and the other is used to define the loss for training. The proposed training without fully sampled data is compared with fully supervised training with ground‐truth data, as well as conventional compressed‐sensing and parallel imaging methods using the publicly available fastMRI knee database. The same physics‐guided neural network is used for both proposed SSDU and supervised training. The SSDU training is also applied to prospectively two‐fold accelerated high‐resolution brain data sets at different acceleration rates, and compared with parallel imaging. ResultsResults on five different knee sequences at an acceleration rate of 4 shows that the proposed self‐supervised approach performs closely with supervised learning, while significantly outperforming conventional compressed‐sensing and parallel imaging, as characterized by quantitative metrics and a clinical reader study. The results on prospectively subsampled brain data sets, in which supervised learning cannot be used due to lack of ground‐truth reference, show that the proposed self‐supervised approach successfully performs reconstruction at high acceleration rates (4, 6, and 8). Image readings indicate improved visual reconstruction quality with the proposed approach compared with parallel imaging at acquisition acceleration. ConclusionThe proposed SSDU approach allows training of physics‐guided deep learning MRI reconstruction without fully sampled data, while achieving comparable results with supervised deep learning MRI trained on fully sampled data. 
    more » « less
  5. null (Ed.)
    Abstract We systematically compared two coding approaches to generate training datasets for machine learning (ML): (i) a holistic approach based on learning progression levels and (ii) a dichotomous, analytic approach of multiple concepts in student reasoning, deconstructed from holistic rubrics. We evaluated four constructed response assessment items for undergraduate physiology, each targeting five levels of a developing flux learning progression in an ion context. Human-coded datasets were used to train two ML models: (i) an 8-classification algorithm ensemble implemented in the Constructed Response Classifier (CRC), and (ii) a single classification algorithm implemented in LightSide Researcher’s Workbench. Human coding agreement on approximately 700 student responses per item was high for both approaches with Cohen’s kappas ranging from 0.75 to 0.87 on holistic scoring and from 0.78 to 0.89 on analytic composite scoring. ML model performance varied across items and rubric type. For two items, training sets from both coding approaches produced similarly accurate ML models, with differences in Cohen’s kappa between machine and human scores of 0.002 and 0.041. For the other items, ML models trained with analytic coded responses and used for a composite score, achieved better performance as compared to using holistic scores for training, with increases in Cohen’s kappa of 0.043 and 0.117. These items used a more complex scenario involving movement of two ions. It may be that analytic coding is beneficial to unpacking this additional complexity. 
    more » « less