skip to main content

Search for: All records

Creators/Authors contains: "Cai, Z."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Free, publicly-accessible full text available January 1, 2024
  2. Mitrovic, A. ; Bosch, N. (Ed.)
    Regular expression (regex) coding has advantages for text analysis. Humans are often able to quickly construct intelligible coding rules with high precision. That is, researchers can identify words and word patterns that correctly classify examples of a particular concept. And, it is often easy to identify false positives and improve the regex classifier so that the positive items are accurately captured. However, ensuring that a regex list is complete is a bigger challenge, because the concepts to be identified in data are often sparsely distributed, which makes it difficult to identify examples of \textit{false negatives}. For this reason, regex-based classifiers suffer by having low recall. That is, it often misses items that should be classified as positive. In this paper, we provide a neural network solution to this problem by identifying a \textit{negative reversion set}, in which false negative items occur much more frequently than in the data set as a whole. Thus, the regex classifier can be more quickly improved by adding missing regexes based on the false negatives found from the negative reversion set. This study used an existing data set collected from a simulation-based learning environment for which researchers had previously defined six codes and developed classifiers with validated regex lists. We randomly constructed incomplete (partial) regex lists and used neural network models to identify negative reversion sets in which the frequency of false negatives increased from a range of 3\\%-8\\% in the full data set to a range of 12\\%-52\\% in the negative reversion set. Based on this finding, we propose an interactive coding mechanism in which human-developed regex classifiers provide input for training machine learning algorithms and machine learning algorithms ``smartly" select highly suspected false negative items for human to more quickly develop regex classifiers. 
    more » « less
  3. Barany, A. ; Damsa, C. (Ed.)
    Regular expression (regex) based automated qualitative coding helps reduce researchers’ effort in manually coding text data, without sacrificing transparency of the coding process. However, researchers using regex based approaches struggle with low recall or high false negative rate during classifier development. Advanced natural language processing techniques, such as topic modeling, latent semantic analysis and neural network classification models help solve this problem in various ways. The latest advance in this direction is the discovery of the so called “negative reversion set (NRS)”, in which false negative items appear more frequently than in the negative set. This helps regex classifier developers more quickly identify missing items and thus improve classification recall. This paper simulates the use of NRS in real coding scenarios and compares the required manual coding items between NRS sampling and random sampling in the process of classifier refinement. The result using one data set with 50,818 items and six associated qualitative codes shows that, on average, using NRS sampling, the required manual coding size could be reduced by 50% to 63%, comparing with random sampling. 
    more » « less
  4. Wasson, B. ; Zörgő, S. (Ed.)
  5. Barany, A. ; Damsa, C. (Ed.)
    In quantitative ethnography (QE) studies which often involve large da-tasets that cannot be entirely hand-coded by human raters, researchers have used supervised machine learning approaches to develop automated classi-fiers. However, QE researchers are rightly concerned with the amount of human coding that may be required to develop classifiers that achieve the high levels of accuracy that QE studies typically require. In this study, we compare a neural network, a powerful traditional supervised learning ap-proach, with nCoder, an active learning technique commonly used in QE studies, to determine which technique requires the least human coding to produce a sufficiently accurate classifier. To do this, we constructed multi-ple training sets from a large dataset used in prior QE studies and designed a Monte Carlo simulation to test the performance of the two techniques sys-tematically. Our results show that nCoder can achieve high predictive accu-racy with significantly less human-coded data than a neural network. 
    more » « less
  6. null (Ed.)
    The Arctic has experienced a warming rate higher than the global mean in the past decades, but previous studies show that there are large uncertainties associated with future Arctic temperature projections. In this study, near- surface mean temperatures in the Arctic are analyzed from 22 models participating in phase 6 of the Coupled Model Intercomparison Project (CMIP6). Compared with the ERA5 reanalysis, most CMIP6 models underestimate the observed mean temperature in the Arctic during 1979–2014. The largest cold biases are found over the Greenland Sea the Barents Sea, and the Kara Sea. Under the SSP1-2.6, SSP2-4.5, and SSP5-8.5 scenarios, the multimodel ensemble mean of 22 CMIP6 models exhibits significant Arctic warming in the future and the warming rate is more than twice that of the global/Northern Hemisphere mean. Model spread is the largest contributor to the overall uncertainty in projections, which accounts for 55.4% of the total uncertainty at the start of projections in 2015 and remains at 32.9% at the end of projections in 2095. Internal variability uncertainty accounts for 39.3% of the total uncertainty at the start of projections but decreases to 6.5% at the end of the twenty-first century, while scenario uncertainty rapidly increases from 5.3% to 60.7% over the period from 2015 to 2095. It is found that the largest model uncertainties are consistent cold bias in the oceanic regions in the models, which is connected with excessive sea ice area caused by the weak Atlantic poleward heat transport. These results suggest that large intermodel spread and uncertainties exist in the CMIP6 models’ simulation and projection of the Arctic near- surface temperature and that there are different responses over the ocean and land in the Arctic to greenhouse gas forcing. Future research needs to pay more attention to the different characteristics and mechanisms of Arctic Ocean and land warming to reduce the spread. 
    more » « less
  7. Weinberger, A. ; Chen, W. ; Hernández-Leo, D. ; Chen, B. (Ed.)
    In this paper, we describe iPlan, a web-based software platform for constructing localized, reduced-form models of land-use impacts, enabling students, civic representatives, and others without specialized knowledge of land-use planning practices to explore and evaluate possible solutions to complex, multi-objective land-use problems in their own local contexts. 
    more » « less
  8. null (Ed.)