skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Quantifying Semantic Similarity Across Languages
Do all languages convey semantic knowledge in the same way? If language simply mirrors the structure of the world, the answer should be a qualified “yes”. If, however, languages impose structure as much as reflecting it, then even ostensibly the “same” word in different languages may mean quite different things. We provide a first pass at a large-scale quantification of cross-linguistic semantic alignment of approximately 1000 meanings in 55 languages. We find that the translation equivalents in some domains (e.g., Time, Quantity, and Kinship) exhibit high alignment across languages while the structure of other domains (e.g., Politics, Food, Emotions, and Animals) exhibits substantial crosslinguistic variability. Our measure of semantic alignment correlates with known phylogenetic distances between languages: more phylogenetically distant languages have less semantic alignment. We also find semantic alignment to correlate with cultural distances between societies speaking the languages, suggesting a rich co-adaptation of language and culture even in domains of experience that appear most constrained by the natural world.  more » « less
Award ID(s):
1734260
PAR ID:
10074361
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 40th Annual Conference of the Cognitive Science Society (CogSci 2018)
Page Range / eLocation ID:
2551–2556
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The view that words are arbitrary is a foundational assumption about language, used to set human languages apart from nonhuman communication. We present here a study of the alignment between the semantic and phonological structure (systematicity) of American Sign Language (ASL), and for comparison, two spoken languages—English and Spanish. Across all three languages, words that are semantically related are more likely to be phonologically related, highlighting systematic alignment between word form and word meaning. Critically, there is a significant effect of iconicity (a perceived physical resemblance between word form and word meaning) on this alignment: words are most likely to be phonologically related when they are semantically related and iconic. This phenomenon is particularly widespread in ASL: half of the signs in the ASL lexicon areiconicallyrelated to other signs, i.e., there is a nonarbitrary relationship between form and meaning that is shared across signs. Taken together, the results reveal that iconicity can act as a driving force behind the alignment between the semantic and phonological structure of spoken and signed languages, but languages may differ in the extent that iconicity structures the lexicon. Theories of language must account for iconicity as a possible organizing principle of the lexicon. 
    more » « less
  2. null (Ed.)
    We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a “subject”) is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and then evaluate them zero-shot on intransitive sentences (where subjecthood classification depends on alignment), within and across languages. We find that the resulting classifier distributions reflect the morphosyntactic alignment of their training languages. Our results demonstrate that mBERT representations are influenced by high-level grammatical features that are not manifested in any one input sentence, and that this is robust across languages. Further examining the characteristics that our classifiers rely on, we find that features such as passive voice, animacy and case strongly correlate with classification decisions, suggesting that mBERT does not encode subjecthood purely syntactically, but that subjecthood embedding is continuous and dependent on semantic and discourse factors, as is proposed in much of the functional linguistics literature. Together, these results provide insight into how grammatical features manifest in contextual embedding spaces, at a level of abstraction not covered by previous work. 
    more » « less
  3. In gradual typing, different languages perform different dynamic type checks for the same program even though the languages have the same static type system. This raises the question of whether, given a gradually typed language, the combination of the translation that injects checks in well-typed terms and the dynamic semantics that determines their behavior sufficiently enforce the static type system of the language. Neither type soundness, nor complete monitoring, nor any other meta-theoretic property of gradually typed languages to date provides a satisfying answer. In response, we present vigilance, a semantic analytical instrument that defines when the check-injecting translation and dynamic semantics of a gradually typed language are adequate for its static type system. Technically, vigilance asks if a given translation-and-semantics combination enforces the complete run-time typing history of a value, which consists of all of the types associated with the value. We show that the standard combination for so-called Natural gradual typing is vigilant for the standard simple type system, but the standard combination for Transient gradual typing is not. At the same time, the standard combination for Transient is vigilant for a tag type system but the standard combination for Natural is not. Hence, we clarify the comparative type-level reasoning power between the two most studied approaches to sound gradual typing. Furthermore, as an exercise that demonstrates how vigilance can guide design, we introduce and examine a new theoretical static gradual type system, dubbed truer, that is stronger than tag typing and more faithfully reflects the type-level reasoning power that the dynamic semantics of Transient gradual typing can guarantee. 
    more » « less
  4. We can easily evaluate similarities between concepts within semantic domains, e.g., doctor and nurse, or violin and piano. Here, we show that people are also able to evaluate similarities across domains, e.g., aligning doctors with pianos and nurses with violins. We argue that understanding how people do this is important for understanding conceptual organization and the ubiquity of metaphorical language. We asked people to answer questions of the form "If a nurse were an animal, they would be a(n)…" (Experiment 1 and 2), and asked them to explain the basis for their response (Experiment 1). People converged to a surprising degree (e.g., 20% answered "cat"). In Experiment 3, we presented people with cross-domain mappings of the form "If a nurse were an animal, they would be a cat” and asked them to indicate how good each mapping was. The results showed that the targets people chose and their goodness ratings of a given response were predicted by similarity along abstract semantic dimensions such as valence, speed, and genderedness. Reliance on such dimensions was also the most common explanation for their responses. Altogether, we show that people can evaluate similarity between very different domains in predictable ways, suggesting that either seemingly concrete concepts are represented along relatively abstract dimensions (e.g., weak-strong) or that they can be readily projected onto these dimensions. 
    more » « less
  5. A real-world text corpus sometimes comprises not only text documents, but also semantic links between them (e.g., academic papers in a bibliographic network are linked by citations and co-authorships). Text documents and semantic connections form a text-rich network, which empowers a wide range of downstream tasks such as classification and retrieval. However, pretraining methods for such structures are still lacking, making it difficult to build one generic model that can be adapted to various tasks on text-rich networks. Current pretraining objectives, such as masked language modeling, purely model texts and do not take inter-document structure information into consideration. To this end, we propose our PretrAining on TexT-Rich NetwOrk framework PATTON. PATTON1 includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction, to capture the inherent dependency between textual attributes and network structure. We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains, where PATTON outperforms baselines significantly and consistently. 
    more » « less