skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Characterizing English Variation across Social Media Communities with BERT
Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community’s unique word types, is used to identify cases where a social group’s language deviates from the norm. We validate our metrics using user-created glossaries and draw on sociolinguistic theories to connect language variation with trends in community behavior. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.  more » « less
Award ID(s):
1813470
PAR ID:
10274010
Author(s) / Creator(s):
;
Editor(s):
Daelemans, Walter
Date Published:
Journal Name:
Transactions of the Association for Computational Linguistics
Volume:
9
ISSN:
2307-387X
Page Range / eLocation ID:
538-556
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The study of language shift, the replacement of one language by another in a community, or subgroup of a speech community, is a prime topic for sociolinguistic analysis: shift is almost always the result of social factors. This paper argues for focusing research on the study of shift in process and, to that end, studying the different kinds of speakers in shifting communities. The prevalent response to massive, global language shift by linguists is language documentation. Although the need for documentation is clear, there have been inadvertent consequences: valorizing last speakers, promoting linguistic purism, and devaluing L2 language learners who, in many communities, represent the future of the language. The urgency of documenting and describing languages with relatively small numbers of elderly speakers has led the linguistic community to focus almost exclusively on such groups and ignore both larger speech communities in earlier stages of shift, and overlook the wide range of speaker types in shift communities. From a social standpoint, the result is that we are often failing to do the language work in precisely those communities where reversing language shift is still relatively easy. From a scientific standpoint, we are missing the opportunity to study language change in process, and missing the chance to study speaker variation in a shift situation. Variation in proficiency and performance across shifting speakers is not random but systematic and correlates with a set of social and cognitive factors. 
    more » « less
  2. In this article, we compare two languages that are approximately fifty years old—Central Taurus Sign Language (CTSL) and Lengua de Señas Nicaragüense (LSN)—by employing two studies. Study 1 analyzes emerging phonology , specifically the size and complexity of the handshape inventories of the two languages, and Study 2 analyzes emerging information packaging in complex predicates, specifically for agency and number. In both studies, we compare data across three groups of CTSL signers and three groups of Nicaraguan signers. The results of both studies show variation across languages and cohorts; the patterns of variation, we argue, are grounded in factors of community size, contact among signers, and the sociocultural makeup of the community, factors that are used in large typological studies on spoken languages. The main findings are as follows: (1) The patterns observed across the Nicaraguan groups display more variation than those across the CTSL groups and (2) The variation among Nicaraguan groups demonstrate that homesigns exhibit a wide range of forms that were pared down in the first decade of LSN and developed and reorganized during LSN's second decade. We suggest that a more precise and nuanced manner of describing sign language communities that considers the following is needed: (1) the degree to which the cultural practices are shared; (2) the size of the deaf community; (3) the ratio of deaf signers to hearing L2 signers; and (4) the rate that new child learners are added. We also call for more comparative work on new sign languages that will assist in determining the effects and interactions of factors of interest to researchers of signed and spoken languages. 
    more » « less
  3. null (Ed.)
    This paper outlines a new model of language revitalisation that understands language to be a characteristic of a nexus of social activities rather than an independent object. Language use is one of an overall set of factors contributing to the wellbeing of a particular community. Our model treats language as one node (or a cluster of nodes) in a complex system of interacting behaviours. Changes to another node or in the language node(s) itself can impact overall social wellbeing, something often ignored by linguists (but not by other social scientists working in Indigenous communities). Disruption to an existing network occurs within a time frame; the longer the disruption, the more likely that the network redefines the group. Variables that define the language ecology operate on multiple levels. For the group and for individuals within the group, there can be considerable variation in usage and proficiency over time. Sustainability cannot be reduced to simple cause-and-effect relationships between sociocultural variables. The next phase of language revitalisation projects should be built around the concept of language activity as part of promoting community wellbeing. The use of complex networks that have been applied to human wellbeing in other contexts support our argument. 
    more » « less
  4. Much work in the space of NLP has used computational methods to explore sociolinguistic variation in text. In this paper, we argue that memes, as multimodal forms of language comprised of visual templates and text, also exhibit meaningful social variation. We construct a computational pipeline to cluster individual instances of memes into templates and semantic variables, taking advantage of their multimodal structure in doing so. We apply this method to a large collection of meme images from Reddit and make available the resulting SEMANTICMEMES dataset of 3.8M images clustered by their semantic function. We use these clusters to analyze linguistic variation in memes, discovering not only that socially meaningful variation in meme usage exists between subreddits, but that patterns of meme innovation and acculturation within these communities align with previous findings on written language. 
    more » « less
  5. It is now well established that memory representations of words are acoustically rich. Alongside this development, a related line of work has shown that the robustness of memory encoding varies widely depending on who is speaking. In this dissertation, I explore the cognitive basis of memory asymmetries at a larger linguistic level (spoken sentences), using the mechanism of socially guided attention allocation to explain how listeners dynamically shift cognitive resources based on the social characteristics of speech. This dissertation consists of three empirical studies designed to investigate the factors that pattern asymmetric memory for spoken language. In the first study, I explored specificity effects at the level of the sentence. While previous research on specificity has centralized the lexical item as the unit of study, I showed that talker-specific memory patterns are also robust at a larger linguistic level, making it likely that acoustic detail is fundamental to human speech perception more broadly. In the second study, I introduced a set of diverse talkers and showed that memory patterns vary widely within this group, and that the memorability of individual talkers is somewhat consistent across listeners. In the third study, I showed that memory behaviors do not depend merely on the speech characteristics of the talker or on the content of the sentence, but on the unique relationship between these two. Memory dramatically improved when semantic content of sentences was congruent with widely held social associations with talkers based on their speech, and this effect was particularly pronounced when listeners had a high cognitive load during encoding. These data collectively provide evidence that listeners allocate attentional resources on an ad hoc, socially guided basis. Listeners subconsciously draw on fine-grained phonetic information and social associations to dynamically adapt low-level cognitive processes while understanding spoken language and encoding it to memory. This approach positions variation in speech not as an obstacle to perception, but as an information source that humans readily recruit to aid in the seamless understanding of spoken language. 
    more » « less