skip to main content

Title: Coding Qualitative Data at Scale: Guidance for Large Coder Teams Based on 18 Studies
We outline a process for using large coder teams (10 + coders) to code large-scale qualitative data sets. The process reflects experience recruiting and managing large teams of novice and trainee coders for 18 projects in the last decade, each engaging a coding team of 12 (minimum) to 54 (maximum) coders. We identify four unique challenges to large coder teams that are not presently discussed in the methodological literature: (1) recruiting and training coders, (2) providing coder compensation and incentives, (3) maintaining data quality and ensuring coding reliability at scale, and (4) building team cohesion and morale. For each challenge, we provide associated guidance. We conclude with a discussion of advantages and disadvantages of large coder teams for qualitative research and provide notes of caution for anyone considering hiring and/or managing large coder teams for research (whether in academia, government and non-profit sectors, or industry).
Authors:
; ; ; ; ; ;
Award ID(s):
2017491
Publication Date:
NSF-PAR ID:
10347979
Journal Name:
International Journal of Qualitative Methods
Volume:
21
Page Range or eLocation-ID:
160940692210758
ISSN:
1609-4069
Sponsoring Org:
National Science Foundation
More Like this
  1. In this theory paper, we set out to consider, as a matter of methodological interest, the use of quantitative measures of inter-coder reliability (e.g., percentage agreement, correlation, Cohen’s Kappa, etc.) as necessary and/or sufficient correlates for quality within qualitative research in engineering education. It is well known that the phrase qualitative research represents a diverse body of scholarship conducted across a range of epistemological viewpoints and methodologies. Given this diversity, we concur with those who state that it is ill advised to propose recipes or stipulate requirements for achieving qualitative research validity and reliability. Yet, as qualitative researchers ourselves, we repeatedly find the need to communicate the validity and reliability—or quality—of our work to different stakeholders, including funding agencies and the public. One method for demonstrating quality, which is increasingly used in qualitative research in engineering education, is the practice of reporting quantitative measures of agreement between two or more people who code the same qualitative dataset. In this theory paper, we address this common practice in two ways. First, we identify instances in which inter-coder reliability measures may not be appropriate or adequate for establishing quality in qualitative research. We query research that suggests that the numerical measure itselfmore »is the goal of qualitative analysis, rather than the depth and texture of the interpretations that are revealed. Second, we identify complexities or methodological questions that may arise during the process of establishing inter-coder reliability, which are not often addressed in empirical publications. To achieve this purposes, in this paper we will ground our work in a review of qualitative articles, published in the Journal of Engineering Education, that have employed inter-rater or inter-coder reliability as evidence of research validity. In our review, we will examine the disparate measures and scores (from 40% agreement to 97% agreement) used as evidence of quality, as well as the theoretical perspectives within which these measures have been employed. Then, using our own comparative case study research as an example, we will highlight the questions and the challenges that we faced as we worked to meet rigorous standards of evidence in our qualitative coding analysis, We will explain the processes we undertook and the challenges we faced as we assigned codes to a large qualitative data set approached from a post positivist perspective. We will situate these coding processes within the larger methodological literature and, in light of contrasting literature, we will describe the principled decisions we made while coding our own data. We will use this review of qualitative research and our own qualitative research experiences to elucidate inconsistencies and unarticulated issues related to evidence for qualitative validity as a means to generate further discussion regarding quality in qualitative coding processes.« less
  2. Introduction Social media has created opportunities for children to gather social support online (Blackwell et al., 2016; Gonzales, 2017; Jackson, Bailey, & Foucault Welles, 2018; Khasawneh, Rogers, Bertrand, Madathil, & Gramopadhye, 2019; Ponathil, Agnisarman, Khasawneh, Narasimha, & Madathil, 2017). However, social media also has the potential to expose children and adolescents to undesirable behaviors. Research showed that social media can be used to harass, discriminate (Fritz & Gonzales, 2018), dox (Wood, Rose, & Thompson, 2018), and socially disenfranchise children (Page, Wisniewski, Knijnenburg, & Namara, 2018). Other research proposes that social media use might be correlated to the significant increase in suicide rates and depressive symptoms among children and adolescents in the past ten years (Mitchell, Wells, Priebe, & Ybarra, 2014). Evidence based research suggests that suicidal and unwanted behaviors can be promulgated through social contagion effects, which model, normalize, and reinforce self-harming behavior (Hilton, 2017). These harmful behaviors and social contagion effects may occur more frequently through repetitive exposure and modelling via social media, especially when such content goes “viral” (Hilton, 2017). One example of viral self-harming behavior that has generated significant media attention is the Blue Whale Challenge (BWC). The hearsay about this challenge is that individuals at allmore »ages are persuaded to participate in self-harm and eventually kill themselves (Mukhra, Baryah, Krishan, & Kanchan, 2017). Research is needed specifically concerning BWC ethical concerns, the effects the game may have on teenagers, and potential governmental interventions. To address this gap in the literature, the current study uses qualitative and content analysis research techniques to illustrate the risk of self-harm and suicide contagion through the portrayal of BWC on YouTube and Twitter Posts. The purpose of this study is to analyze the portrayal of BWC on YouTube and Twitter in order to identify the themes that are presented on YouTube and Twitter posts that share and discuss BWC. In addition, we want to explore to what extent are YouTube videos compliant with safe and effective suicide messaging guidelines proposed by the Suicide Prevention Resource Center (SPRC). Method Two social media websites were used to gather the data: 60 videos and 1,112 comments from YouTube and 150 posts from Twitter. The common themes of the YouTube videos, comments on those videos, and the Twitter posts were identified using grounded, thematic content analysis on the collected data (Padgett, 2001). Three codebooks were built, one for each type of data. The data for each site were analyzed, and the common themes were identified. A deductive coding analysis was conducted on the YouTube videos based on the nine SPRC safe and effective messaging guidelines (Suicide Prevention Resource Center, 2006). The analysis explored the number of videos that violated these guidelines and which guidelines were violated the most. The inter-rater reliabilities between the coders ranged from 0.61 – 0.81 based on Cohen’s kappa. Then the coders conducted consensus coding. Results & Findings Three common themes were identified among all the posts in the three social media platforms included in this study. The first theme included posts where social media users were trying to raise awareness and warning parents about this dangerous phenomenon in order to reduce the risk of any potential participation in BWC. This was the most common theme in the videos and posts. Additionally, the posts claimed that there are more than 100 people who have played BWC worldwide and provided detailed description of what each individual did while playing the game. These videos also described the tasks and different names of the game. Only few videos provided recommendations to teenagers who might be playing or thinking of playing the game and fewer videos mentioned that the provided statistics were not confirmed by reliable sources. The second theme included posts of people that either criticized the teenagers who participated in BWC or made fun of them for a couple of reasons: they agreed with the purpose of BWC of “cleaning the society of people with mental issues,” or they misunderstood why teenagers participate in these kind of challenges, such as thinking they mainly participate due to peer pressure or to “show off”. The last theme we identified was that most of these users tend to speak in detail about someone who already participated in BWC. These videos and posts provided information about their demographics and interviews with their parents or acquaintances, who also provide more details about the participant’s personal life. The evaluation of the videos based on the SPRC safe messaging guidelines showed that 37% of the YouTube videos met fewer than 3 of the 9 safe messaging guidelines. Around 50% of them met only 4 to 6 of the guidelines, while the remaining 13% met 7 or more of the guidelines. Discussion This study is the first to systematically investigate the quality, portrayal, and reach of BWC on social media. Based on our findings from the emerging themes and the evaluation of the SPRC safe messaging guidelines we suggest that these videos could contribute to the spread of these deadly challenges (or suicide in general since the game might be a hoax) instead of raising awareness. Our suggestion is parallel with similar studies conducted on the portrait of suicide in traditional media (Fekete & Macsai, 1990; Fekete & Schmidtke, 1995). Most posts on social media romanticized people who have died by following this challenge, and younger vulnerable teens may see the victims as role models, leading them to end their lives in the same way (Fekete & Schmidtke, 1995). The videos presented statistics about the number of suicides believed to be related to this challenge in a way that made suicide seem common (Cialdini, 2003). In addition, the videos presented extensive personal information about the people who have died by suicide while playing the BWC. These videos also provided detailed descriptions of the final task, including pictures of self-harm, material that may encourage vulnerable teens to consider ending their lives and provide them with methods on how to do so (Fekete & Macsai, 1990). On the other hand, these videos both failed to emphasize prevention by highlighting effective treatments for mental health problems and failed to encourage teenagers with mental health problems to seek help and providing information on where to find it. YouTube and Twitter are capable of influencing a large number of teenagers (Khasawneh, Ponathil, Firat Ozkan, & Chalil Madathil, 2018; Pater & Mynatt, 2017). We suggest that it is urgent to monitor social media posts related to BWC and similar self-harm challenges (e.g., the Momo Challenge). Additionally, the SPRC should properly educate social media users, particularly those with more influence (e.g., celebrities) on elements that boost negative contagion effects. While the veracity of these challenges is doubted by some, posting about the challenges in unsafe manners can contribute to contagion regardless of the challlenges’ true nature.« less
  3. Abstract: Jury notetaking can be controversial despite evidence suggesting benefits for recall and understanding. Research on note taking has historically focused on the deliberation process. Yet, little research explores the notes themselves. We developed a 10-item coding guide to explore what jurors take notes on (e.g., simple vs. complex evidence) and how they take notes (e.g., gist vs. specific representation). In general, jurors made gist representations of simple and complex information in their notes. This finding is consistent with Fuzzy Trace Theory (Reyna & Brainerd, 1995) and suggests notes may serve as a general memory aid, rather than verbatim representation. Summary: The practice of jury notetaking in the courtroom is often contested. Some states allow it (e.g., Nebraska: State v. Kipf, 1990), while others forbid it (e.g., Louisiana: La. Code of Crim. Proc., Art. 793). Some argue notes may serve as a memory aid, increase juror confidence during deliberation, and help jurors engage in the trial (Hannaford & Munsterman, 2001; Heuer & Penrod, 1988, 1994). Others argue notetaking may distract jurors from listening to evidence, that juror notes may be given undue weight, and that those who took notes may dictate the deliberation process (Dann, Hans, & Kaye, 2005). Whilemore »research has evaluated the efficacy of juror notes on evidence comprehension, little work has explored the specific content of juror notes. In a similar project on which we build, Dann, Hans, and Kaye (2005) found jurors took on average 270 words of notes each with 85% including references to jury instructions in their notes. In the present study we use a content analysis approach to examine how jurors take notes about simple and complex evidence. We were particularly interested in how jurors captured gist and specific (verbatim) information in their notes as they have different implications for information recall during deliberation. According to Fuzzy Trace Theory (Reyna & Brainerd, 1995), people extract “gist” or qualitative meaning from information, and also exact, verbatim representations. Although both are important for helping people make well-informed judgments, gist-based understandings are purported to be even more important than verbatim understanding (Reyna, 2008; Reyna & Brainer, 2007). As such, it could be useful to examine how laypeople represent information in their notes during deliberation of evidence. Methods Prior to watching a 45-minute mock bank robbery trial, jurors were given a pen and notepad and instructed they were permitted to take notes. The evidence included testimony from the defendant, witnesses, and expert witnesses from prosecution and defense. Expert testimony described complex mitochondrial DNA (mtDNA) evidence. The present analysis consists of pilot data representing 2,733 lines of notes from 52 randomly-selected jurors across 41 mock juries. Our final sample for presentation at AP-LS will consist of all 391 juror notes in our dataset. Based on previous research exploring jury note taking as well as our specific interest in gist vs. specific encoding of information, we developed a coding guide to quantify juror note-taking behaviors. Four researchers independently coded a subset of notes. Coders achieved acceptable interrater reliability [(Cronbach’s Alpha = .80-.92) on all variables across 20% of cases]. Prior to AP-LS, we will link juror notes with how they discuss scientific and non-scientific evidence during jury deliberation. Coding Note length. Before coding for content, coders counted lines of text. Each notepad line with at minimum one complete word was coded as a line of text. Gist information vs. Specific information. Any line referencing evidence was coded as gist or specific. We coded gist information as information that did not contain any specific details but summarized the meaning of the evidence (e.g., “bad, not many people excluded”). Specific information was coded as such if it contained a verbatim descriptive (e.g.,“<1 of people could be excluded”). We further coded whether this information was related to non-scientific evidence or related to the scientific DNA evidence. Mentions of DNA Evidence vs. Other Evidence. We were specifically interested in whether jurors mentioned the DNA evidence and how they captured complex evidence. When DNA evidence was mention we coded the content of the DNA reference. Mentions of the characteristics of mtDNA vs nDNA, the DNA match process or who could be excluded, heteroplasmy, references to database size, and other references were coded. Reliability. When referencing DNA evidence, we were interested in whether jurors mentioned the evidence reliability. Any specific mention of reliability of DNA evidence was noted (e.g., “MT DNA is not as powerful, more prone to error”). Expert Qualification. Finally, we were interested in whether jurors noted an expert’s qualifications. All references were coded (e.g., “Forensic analyst”). Results On average, jurors took 53 lines of notes (range: 3-137 lines). Most (83%) mentioned jury instructions before moving on to case specific information. The majority of references to evidence were gist references (54%) focusing on non-scientific evidence and scientific expert testimony equally (50%). When jurors encoded information using specific references (46%), they referenced non-scientific evidence and expert testimony equally as well (50%). Thirty-three percent of lines were devoted to expert testimony with every juror including at least one line. References to the DNA evidence were usually focused on who could be excluded from the FBIs database (43%), followed by references to differences between mtDNA vs nDNA (30%), and mentions of the size of the database (11%). Less frequently, references to DNA evidence focused on heteroplasmy (5%). Of those references that did not fit into a coding category (11%), most focused on the DNA extraction process, general information about DNA, and the uniqueness of DNA. We further coded references to DNA reliability (15%) as well as references to specific statistical information (14%). Finally, 40% of jurors made reference to an expert’s qualifications. Conclusion Jury note content analysis can reveal important information about how jurors capture trial information (e.g., gist vs verbatim), what evidence they consider important, and what they consider relevant and irrelevant. In our case, it appeared jurors largely created gist representations of information that focused equally on non-scientific evidence and scientific expert testimony. This finding suggests note taking may serve not only to represent information verbatim, but also and perhaps mostly as a general memory aid summarizing the meaning of evidence. Further, jurors’ references to evidence tended to be equally focused on the non-scientific evidence and the scientifically complex DNA evidence. This observation suggests jurors may attend just as much to non-scientific evidence as they to do complex scientific evidence in cases involving complicated evidence – an observation that might inform future work on understanding how jurors interpret evidence in cases with complex information. Learning objective: Participants will be able to describe emerging evidence about how jurors take notes during trial.« less
  4. We present an ethnographic study of secure software development processes in a software company using the anthropological research method of participant observation. Two PhD students in computer science trained in qualitative methods were embedded in a software company for 1.5 years of total research time. The researchers participated in everyday work activities such as coding and meetings, and observed software (in)security phenomena both through investigating historical data (code repositories and ticketing system records), and through pen-testing the developed software and observing developers’ and management’s reactions to the discovered vulnerabilities. Our study found that 1) security vulnerabilities are sometimes intentionally introduced and/or overlooked due to the difficulty in managing the various stakeholders’ responsibilities in an economic ecosystem, and cannot be simply blamed on developers’ lack of knowledge or skills; 2) accidental vulnerabilities discovered in the pen-testing process produce different reactions in the development team, often times contrary to what a security researcher would predict. These findings highlight the nuanced nature of the root causes of software vulnerabilities and indicate the need to take into account a significant amount of contextual information to understand how and why software vulnerabilities emerge during software development. Rather than simply addressing deficits in developer knowledge ormore »practice, this research sheds light on at times forgotten human factors that significantly impact the security of software developed by actual companies. Our analysis also shows that improving software security in the development process can benefit from a co-creation model, where security experts work side by side with software developers to better identify security concerns and provide tools that are readily applicable within the specific context of the software development workflow.« less
  5. In 2011, the National Science Foundation launched the I-Corps Program and as of today close to one hundred institutions are participating through Nodes or Sites program. While both program focus on providing training and funds to accelerate the implementation of innovative ideas to market, they have different implementation models and thus challenges. For I-Corps Sites, while each institution utilizes similar approaches on the implementation, including an I-Corps team formation, knowledge and skills training, customer discovery and guidance from experienced entrepreneurs, each ecosystem is unique because the program outcomes are closely related to the entrepreneurial culture both on campus and also in the surrounding local community. A major challenge for Sites is recruiting quality teams and having access to qualified mentors to provide guidance to teams. In this paper, we will present the implementation of a Site in a large public institution located away from a large metropolitan area, the challenges we addressed both in recruiting teams and mentors, and how the program has evolved in its current state. In addition, authors will be able to present on data from the program evaluation which will include findings from pre- and post-quizzes on knowledge of entrepreneurship terms and pre- and post-program surveysmore »that captured changes in perceptions of entrepreneurship, such as interest in entrepreneurship, confidence in value position, and self-efficacy in entrepreneurship, marketing/business planning, and customer interview. In this paper, we will present data from five I-Corps Site cohorts representing close to fifty student teams. Since program participants represent a diverse group (33% females and 15% ethnic minorities) and also wide range of educational levels (freshman to graduate students), we are able to evaluate program impact also with respect to gender, race/ethnicity, and classification. This paper will provide valuable information for institutions interested in pursuing an I-Corps grant and to those who are already have a grant but are looking for additional ways to further enhance program impact on their campus.« less