skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Why are Some Words More Frequent than Others? New Insights from Network Science
Why are some words more frequent than others? Surprisingly, the obvious answers to this seemingly simple question, e.g., that frequent words reflect greater communicative needs, are either wrong or incomplete. We show that a word’s frequency is strongly associated with its position in a semantic association network. More centrally located words are more frequent. But is a word’s centrality in a network merely a reflection of inherent centrality of the word’s meaning? Through cross-linguistic comparisons, we found that differences in the frequency of translation-equivalents are predicted by differences in the word’s network structures in the different languages. Specifically, frequency was linked to how many connections a word had and to its capacity to bridge words that are typically not linked. This hints that a word’s frequency (and with it, its meaning) may change as a function of the word’s association with other words.  more » « less
Award ID(s):
2020969
PAR ID:
10547763
Author(s) / Creator(s):
; ;
Editor(s):
Nölle, J; Raviv, L; Graham, E; Hartmann, S; Jadoul, Y; Josserand, M; Matzinger, T; Mudd, K; Pleyer, M; Slonimska, A; Wacewicz, S; Watson, S
Publisher / Repository:
The Evolution of Language: Proceedings of the 15th International Conference (Evolang XV)
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Many changes in our digital corpus have been brought about by the interplay between rapid advances in digital communication and the current environment characterized by pandemics, political polarization, and social unrest. One such change is the pace with which new words enter the mass vocabulary and the frequency at which meanings, perceptions, and interpretations of existing expressions change. The current state-of-the-art algorithms do not allow for an intuitive and rigorous detection of these changes in word meanings over time. We propose a dynamic graph-theoretic approach to inferring the semantics of words and phrases (“terms”) and detecting temporal shifts. Our approach represents each term as a stochastic time-evolving set of contextual words and is a count-based distributional semantic model in nature. We use local clustering techniques to assess the structural changes in a given word’s contextual words. We demonstrate the efficacy of our method by investigating the changes in the semantics of the phrase “Chinavirus”. We conclude that the term took on a much more pejorative meaning when the White House used the term in the second half of March 2020, although the effect appears to have been temporary. We make both the dataset and the code used to generate this paper’s results available. 
    more » « less
  2. It is well-known that children rapidly learn words, following a range of heuristics. What is less well appreciated is that – because most words are polysemous and have multiple meanings (e.g., ‘glass’ can label a material and drinking vessel) – children will often be learning a new meaning for a known word, rather than an entirely new word. Across four experiments we show that children flexibly adapt a well-known heuristic – the shape bias – when learning polysemous words. Consistent with previous studies, we find that children and adults preferentially extend a new object label to other objects of the same shape. But we also find that when a new word for an object (‘a gup’) has previously been used to label the material composing that object (‘some gup’), children and adults override the shape bias, and are more likely to extend the object label by material (Experiments 1 and 3). Further, we find that, just as an older meaning of a polysemous word constrains interpretations of a new word meaning, encountering a new word meaning leads learners to update their interpretations of an older meaning (Experiment 2). Finally, we find that these effects only arise when learners can perceive that a word’s meanings are related, not when they are arbitrarily paired (Experiment 4). Together, these findings show that children can exploit cues from polysemy to infer how new word meanings should be extended, suggesting that polysemy may facilitate word learning and invite children to construe categories in new ways. 
    more » « less
  3. Abstract What makes a word easy to learn? Early‐learned words are frequent and tend to name concrete referents. But words typically do not occur in isolation. Some words are predictable from their contexts; others are less so. Here, we investigate whether predictability relates to when children start producing different words (age of acquisition; AoA). We operationalized predictability in terms of a word's surprisal in child‐directed speech, computed using n‐gram and long‐short‐term‐memory (LSTM) language models. Predictability derived from LSTMs was generally a better predictor than predictability derived from n‐gram models. Across five languages, average surprisal was positively correlated with the AoA of predicates and function words but not nouns. Controlling for concreteness and word frequency, more predictable predicates and function words were learned earlier. Differences in predictability between languages were associated with cross‐linguistic differences in AoA: the same word (when it was a predicate) was produced earlier in languages where the word was more predictable. 
    more » « less
  4. Corpus data suggests that frequent words have lower rates of replacement and regularization. It is not clear, however, whether this holds due to stronger selection against innovation among high-frequency words or due to weaker drift at high frequencies. Here, we report two experiments designed to probe this question. Participants were tasked with learning a simple miniature language consisting of two nouns and two plural markers. After exposing plural markers to drift and selection of varying strengths, we tracked noun regularization. Regularization was greater for low- than for high-frequency nouns, with no detectable effect of selection. Our results therefore suggest that lower rates of regularization of more frequent words may be due to drift alone. 
    more » « less
  5. null (Ed.)
    Fringe groups and organizations have a long history of using euphemisms---ordinary-sounding words with a secret meaning---to conceal what they are discussing. Nowadays, one common use of euphemisms is to evade content moderation policies enforced by social media platforms. Existing tools for enforcing policy automatically rely on keyword searches for words on a ``ban list'', but these are notoriously imprecise: even when limited to swearwords, they can still cause embarrassing false positives. When a commonly used ordinary word acquires a euphemistic meaning, adding it to a keyword-based ban list is hopeless: consider ``pot'' (storage container or marijuana?) or ``heater'' (household appliance or firearm?). The current generation of social media companies instead hire staff to check posts manually, but this is expensive, inhumane, and not much more effective. It is usually apparent to a human moderator that a word is being used euphemistically, but they may not know what the secret meaning is, and therefore whether the message violates policy. Also, when a euphemism is banned, the group that used it need only invent another one, leaving moderators one step behind. This paper will demonstrate unsupervised algorithms that, by analyzing words in their sentence-level context, can both detect words being used euphemistically, and identify the secret meaning of each word. Compared to the existing state of the art, which uses context-free word embeddings, our algorithm for detecting euphemisms achieves 30--400\% higher detection accuracies of unlabeled euphemisms in a text corpus. Our algorithm for revealing euphemistic meanings of words is the first of its kind, as far as we are aware. In the arms race between content moderators and policy evaders, our algorithms may help shift the balance in the direction of the moderators. 
    more » « less