- Award ID(s):
- NSF-PAR ID:
- Date Published:
- Journal Name:
- Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Page Range / eLocation ID:
- 1041 to 1052
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
Recent studies show that Natural Language Processing (NLP) technologies propagate societal biases about demographic groups associated with attributes such as gender, race, and nationality. To create interventions and mitigate these biases and associated harms, it is vital to be able to detect and measure such biases. While existing works propose bias evaluation and mitigation methods for various tasks, there remains a need to cohesively understand the biases and the specific harms they measure, and how different measures compare with each other. To address this gap, this work presents a practical framework of harms and a series of questions that practitioners can answer to guide the development of bias measures. As a validation of our framework and documentation questions, we also present several case studies of how existing bias measures in NLP—both intrinsic measures of bias in representations and extrinsic measures of bias of downstream applications—can be aligned with different harms and how our proposed documentation questions facilitates more holistic understanding of what bias measures are measuring.more » « less
Perlman, Marcus (Ed.)Longstanding cross-linguistic work on event representations in spoken languages have argued for a robust mapping between an event’s underlying representation and its syntactic encoding, such that–for example–the agent of an event is most frequently mapped to subject position. In the same vein, sign languages have long been claimed to construct signs that visually represent their meaning, i.e., signs that are iconic. Experimental research on linguistic parameters such as plurality and aspect has recently shown some of them to be visually universal in sign, i.e. recognized by non-signers as well as signers, and have identified specific visual cues that achieve this mapping. However, little is known about what makes action representations in sign language iconic, or whether and how the mapping of underlying event representations to syntactic encoding is visually apparent in the form of a verb sign. To this end, we asked what visual cues non-signers may use in evaluating transitivity (i.e., the number of entities involved in an action). To do this, we correlated non-signer judgments about transitivity of verb signs from American Sign Language (ASL) with phonological characteristics of these signs. We found that non-signers did not accurately guess the transitivity of the signs, but that non-signer transitivity judgments can nevertheless be predicted from the signs’ visual characteristics. Further, non-signers cue in on just those features that code event representations across sign languages, despite interpreting them differently. This suggests the existence of visual biases that underlie detection of linguistic categories, such as transitivity, which may uncouple from underlying conceptual representations over time in mature sign languages due to lexicalization processes.more » « less
Abstract: Jury notetaking can be controversial despite evidence suggesting benefits for recall and understanding. Research on note taking has historically focused on the deliberation process. Yet, little research explores the notes themselves. We developed a 10-item coding guide to explore what jurors take notes on (e.g., simple vs. complex evidence) and how they take notes (e.g., gist vs. specific representation). In general, jurors made gist representations of simple and complex information in their notes. This finding is consistent with Fuzzy Trace Theory (Reyna & Brainerd, 1995) and suggests notes may serve as a general memory aid, rather than verbatim representation. Summary: The practice of jury notetaking in the courtroom is often contested. Some states allow it (e.g., Nebraska: State v. Kipf, 1990), while others forbid it (e.g., Louisiana: La. Code of Crim. Proc., Art. 793). Some argue notes may serve as a memory aid, increase juror confidence during deliberation, and help jurors engage in the trial (Hannaford & Munsterman, 2001; Heuer & Penrod, 1988, 1994). Others argue notetaking may distract jurors from listening to evidence, that juror notes may be given undue weight, and that those who took notes may dictate the deliberation process (Dann, Hans, & Kaye, 2005). While research has evaluated the efficacy of juror notes on evidence comprehension, little work has explored the specific content of juror notes. In a similar project on which we build, Dann, Hans, and Kaye (2005) found jurors took on average 270 words of notes each with 85% including references to jury instructions in their notes. In the present study we use a content analysis approach to examine how jurors take notes about simple and complex evidence. We were particularly interested in how jurors captured gist and specific (verbatim) information in their notes as they have different implications for information recall during deliberation. According to Fuzzy Trace Theory (Reyna & Brainerd, 1995), people extract “gist” or qualitative meaning from information, and also exact, verbatim representations. Although both are important for helping people make well-informed judgments, gist-based understandings are purported to be even more important than verbatim understanding (Reyna, 2008; Reyna & Brainer, 2007). As such, it could be useful to examine how laypeople represent information in their notes during deliberation of evidence. Methods Prior to watching a 45-minute mock bank robbery trial, jurors were given a pen and notepad and instructed they were permitted to take notes. The evidence included testimony from the defendant, witnesses, and expert witnesses from prosecution and defense. Expert testimony described complex mitochondrial DNA (mtDNA) evidence. The present analysis consists of pilot data representing 2,733 lines of notes from 52 randomly-selected jurors across 41 mock juries. Our final sample for presentation at AP-LS will consist of all 391 juror notes in our dataset. Based on previous research exploring jury note taking as well as our specific interest in gist vs. specific encoding of information, we developed a coding guide to quantify juror note-taking behaviors. Four researchers independently coded a subset of notes. Coders achieved acceptable interrater reliability [(Cronbach’s Alpha = .80-.92) on all variables across 20% of cases]. Prior to AP-LS, we will link juror notes with how they discuss scientific and non-scientific evidence during jury deliberation. Coding Note length. Before coding for content, coders counted lines of text. Each notepad line with at minimum one complete word was coded as a line of text. Gist information vs. Specific information. Any line referencing evidence was coded as gist or specific. We coded gist information as information that did not contain any specific details but summarized the meaning of the evidence (e.g., “bad, not many people excluded”). Specific information was coded as such if it contained a verbatim descriptive (e.g.,“<1 of people could be excluded”). We further coded whether this information was related to non-scientific evidence or related to the scientific DNA evidence. Mentions of DNA Evidence vs. Other Evidence. We were specifically interested in whether jurors mentioned the DNA evidence and how they captured complex evidence. When DNA evidence was mention we coded the content of the DNA reference. Mentions of the characteristics of mtDNA vs nDNA, the DNA match process or who could be excluded, heteroplasmy, references to database size, and other references were coded. Reliability. When referencing DNA evidence, we were interested in whether jurors mentioned the evidence reliability. Any specific mention of reliability of DNA evidence was noted (e.g., “MT DNA is not as powerful, more prone to error”). Expert Qualification. Finally, we were interested in whether jurors noted an expert’s qualifications. All references were coded (e.g., “Forensic analyst”). Results On average, jurors took 53 lines of notes (range: 3-137 lines). Most (83%) mentioned jury instructions before moving on to case specific information. The majority of references to evidence were gist references (54%) focusing on non-scientific evidence and scientific expert testimony equally (50%). When jurors encoded information using specific references (46%), they referenced non-scientific evidence and expert testimony equally as well (50%). Thirty-three percent of lines were devoted to expert testimony with every juror including at least one line. References to the DNA evidence were usually focused on who could be excluded from the FBIs database (43%), followed by references to differences between mtDNA vs nDNA (30%), and mentions of the size of the database (11%). Less frequently, references to DNA evidence focused on heteroplasmy (5%). Of those references that did not fit into a coding category (11%), most focused on the DNA extraction process, general information about DNA, and the uniqueness of DNA. We further coded references to DNA reliability (15%) as well as references to specific statistical information (14%). Finally, 40% of jurors made reference to an expert’s qualifications. Conclusion Jury note content analysis can reveal important information about how jurors capture trial information (e.g., gist vs verbatim), what evidence they consider important, and what they consider relevant and irrelevant. In our case, it appeared jurors largely created gist representations of information that focused equally on non-scientific evidence and scientific expert testimony. This finding suggests note taking may serve not only to represent information verbatim, but also and perhaps mostly as a general memory aid summarizing the meaning of evidence. Further, jurors’ references to evidence tended to be equally focused on the non-scientific evidence and the scientifically complex DNA evidence. This observation suggests jurors may attend just as much to non-scientific evidence as they to do complex scientific evidence in cases involving complicated evidence – an observation that might inform future work on understanding how jurors interpret evidence in cases with complex information. Learning objective: Participants will be able to describe emerging evidence about how jurors take notes during trial.more » « less
Question-answering datasets require a broad set of reasoning skills. We show how to use question decompositions to teach language models these broad reasoning skills in a robust fashion. Specifically, we use widely available QDMR representations to programmatically create hard-to-cheat synthetic contexts for real questions in six multi-step reasoning datasets. These contexts are carefully designed to avoid common reasoning shortcuts prevalent in real contexts that prevent models from learning the right skills. This results in a pretraining dataset, named TeaBReaC, containing 525K multi-step questions (with associated formal programs) covering about 900 reasoning patterns. We show that pretraining standard language models (LMs) on TeaBReaC before fine-tuning them on target datasets improves their performance by up to 13 F1 points across 4 multi-step QA datasets, with up to 21 point gain on more complex questions. The resulting models also demonstrate higher robustness, with a 5-8 F1 point improvement on two contrast sets. Furthermore, TeaBReaC pretraining substantially improves model performance and robustness even when starting with numerate LMs pretrained using recent methods (e.g., PReasM, POET). Our work thus shows how to effectively use decomposition-guided contexts to robustly teach multi-step reasoning.more » « less
Women and people of color continue to be underrepresented in many STEM fields and careers. Many studies have linked societal biases against the mathematical abilities of women and people of color to this underrepresentation, as well as to earlier measures of mathematical confidence and performance. Recent studies have shown that teachers may unintentionally have biases that reflect those in broader society. Yet, many studies on teachers’ reports of students’ abilities use data in the field—not experimental data—and thus often cannot say if the findings reflect bias or actual differences. The few experimental studies conducted suggest bias against the abilities of girls and students of color, but the prior work has limitations, which we seek to address (e.g., local samples, no exploration of moderators, no preregistration).
In this preregistered experiment of 458 teachers across the U.S., we randomly assigned gender- and race-specific names to solutions to math problems, then asked teachers to rate the correctness of the solution, as well as the student’s math ability and effort. Teachers also completed scales reflecting their own beliefs and dispositions, which we then assessed how those beliefs/dispositions moderated their biases. We used multilevel modeling to account for the nested data structure.
Consistent with our preregistered hypotheses, when the solution was not fully correct, findings suggest teachers thought boys had higher ability, even though the same teachers did not report differences in the correctness of the solution or perceived effort. Moreover, teachers who reported that gender disparities no longer exist in society were particularly likely to underestimate girls’ abilities. Although findings revealed no evidence of racial bias on average, teachers’ math anxiety moderated their ability judgments of students from different races, albeit with only marginal significance; teachers with high math anxiety tended to assume that White students had higher math ability than students of color.
The present research identifies teachers’ beliefs and dispositions that moderate their gender and racial biases. This experimental evidence sheds new light on why even low-performing boys consistently report higher math confidence and pursue STEM—namely, their teachers believe they have higher mathematical ability.