skip to main content


Title: A Cross-Linguistic Pressure for Uniform Information Density in Word Order
Abstract

While natural languages differ widely in both canonical word order and word order flexibility, their word orders still follow shared cross-linguistic statistical patterns, often attributed to functional pressures. In the effort to identify these pressures, prior work has compared real and counterfactual word orders. Yet one functional pressure has been overlooked in such investigations: The uniform information density (UID) hypothesis, which holds that information should be spread evenly throughout an utterance. Here, we ask whether a pressure for UID may have influenced word order patterns cross-linguistically. To this end, we use computational models to test whether real orders lead to greater information uniformity than counterfactual orders. In our empirical study of 10 typologically diverse languages, we find that: (i) among SVO languages, real word orders consistently have greater uniformity than reverse word orders, and (ii) only linguistically implausible counterfactual orders consistently exceed the uniformity of real orders. These findings are compatible with a pressure for information uniformity in the development and usage of natural languages.1

 
more » « less
Award ID(s):
2121074
NSF-PAR ID:
10488152
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
MIT Press
Date Published:
Journal Name:
Transactions of the Association for Computational Linguistics
Volume:
11
ISSN:
2307-387X
Page Range / eLocation ID:
1048 to 1065
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    To model behavioral and neural correlates of language comprehension in naturalistic environments, researchers have turned to broad‐coverage tools from natural‐language processing and machine learning. Where syntactic structure is explicitly modeled, prior work has relied predominantly on context‐free grammars (CFGs), yet such formalisms are not sufficiently expressive for human languages. Combinatory categorial grammars (CCGs) are sufficiently expressive directly compositional models of grammar with flexible constituency that affords incremental interpretation. In this work, we evaluate whether a more expressive CCG provides a better model than a CFG for human neural signals collected with functional magnetic resonance imaging (fMRI) while participants listen to an audiobook story. We further test between variants of CCG that differ in how they handle optional adjuncts. These evaluations are carried out against a baseline that includes estimates of next‐word predictability from a transformer neural network language model. Such a comparison reveals unique contributions of CCG structure‐building predominantly in the left posterior temporal lobe: CCG‐derived measures offer a superior fit to neural signals compared to those derived from a CFG. These effects are spatially distinct from bilateral superior temporal effects that are unique to predictability. Neural effects for structure‐building are thus separable from predictability during naturalistic listening, and those effects are best characterized by a grammar whose expressive power is motivated on independent linguistic grounds.

     
    more » « less
  2. Different languages might have different word orders. In this paper, we investigate crosslingual transfer and posit that an orderagnostic model will perform better when transferring to distant foreign languages. To test our hypothesis, we train dependency parsers on an English corpus and evaluate their transfer performance on 30 other languages. Specifically, we compare encoders and decoders based on Recurrent Neural Networks (RNNs) and modified self-attentive architectures. The former relies on sequential information while the latter is more flexible at modeling word order. Rigorous experiments and detailed analysis shows that RNN-based architectures transfer well to languages that are close to English, while self-attentive models have better overall cross-lingual transferability and perform especially well on distant languages. 
    more » « less
  3. Abstract

    Relationships between body shape and escape performance are well established for many species. However, organisms can face multiple selection pressures that might impose competing demands. Many fishes use fast starts for escaping predator attacks, whereas some species of gobiid fishes have evolved the ability to climb waterfalls out of predator-dense habitats. The ancestral ‘powerburst’ climbing mechanism uses lateral body undulations to move up waterfalls, whereas a derived ‘inching’ mechanism uses rectilinear locomotion. We examined whether fast-start performance is impacted by selection imposed from the new functional demands of climbing. We predicted that non-climbing species would show morphology and fast-start performance that facilitate predator evasion, because these fish live consistently with predators and are not constrained by the demands of climbing. We also predicted that, by using lateral undulations, powerburst climbers would show escape performance superior to that of inchers. We compared fast starts and body shape across six goby species. As predicted, non-climbing fish exhibited distinct morphology and responded more frequently to an attack stimulus than climbing species. Contrary to our predictions, we found no differences in escape performance among climbing styles. These results indicate that selection for a competing pressure need not limit the ability of prey to escape predator attacks.

     
    more » « less
  4. INTRODUCTION A major challenge in genomics is discerning which bases among billions alter organismal phenotypes and affect health and disease risk. Evidence of past selective pressure on a base, whether highly conserved or fast evolving, is a marker of functional importance. Bases that are unchanged in all mammals may shape phenotypes that are essential for organismal health. Bases that are evolving quickly in some species, or changed only in species that share an adaptive trait, may shape phenotypes that support survival in specific niches. Identifying bases associated with exceptional capacity for cellular recovery, such as in species that hibernate, could inform therapeutic discovery. RATIONALE The power and resolution of evolutionary analyses scale with the number and diversity of species compared. By analyzing genomes for hundreds of placental mammals, we can detect which individual bases in the genome are exceptionally conserved (constrained) and likely to be functionally important in both coding and noncoding regions. By including species that represent all orders of placental mammals and aligning genomes using a method that does not require designating humans as the reference species, we explore unusual traits in other species. RESULTS Zoonomia’s mammalian comparative genomics resources are the most comprehensive and statistically well-powered produced to date, with a protein-coding alignment of 427 mammals and a whole-genome alignment of 240 placental mammals representing all orders. We estimate that at least 10.7% of the human genome is evolutionarily conserved relative to neutrally evolving repeats and identify about 101 million significantly constrained single bases (false discovery rate < 0.05). We cataloged 4552 ultraconserved elements at least 20 bases long that are identical in more than 98% of the 240 placental mammals. Many constrained bases have no known function, illustrating the potential for discovery using evolutionary measures. Eighty percent are outside protein-coding exons, and half have no functional annotations in the Encyclopedia of DNA Elements (ENCODE) resource. Constrained bases tend to vary less within human populations, which is consistent with purifying selection. Species threatened with extinction have few substitutions at constrained sites, possibly because severely deleterious alleles have been purged from their small populations. By pairing Zoonomia’s genomic resources with phenotype annotations, we find genomic elements associated with phenotypes that differ between species, including olfaction, hibernation, brain size, and vocal learning. We associate genomic traits, such as the number of olfactory receptor genes, with physical phenotypes, such as the number of olfactory turbinals. By comparing hibernators and nonhibernators, we implicate genes involved in mitochondrial disorders, protection against heat stress, and longevity in this physiologically intriguing phenotype. Using a machine learning–based approach that predicts tissue-specific cis - regulatory activity in hundreds of species using data from just a few, we associate changes in noncoding sequence with traits for which humans are exceptional: brain size and vocal learning. CONCLUSION Large-scale comparative genomics opens new opportunities to explore how genomes evolved as mammals adapted to a wide range of ecological niches and to discover what is shared across species and what is distinctively human. High-quality data for consistently defined phenotypes are necessary to realize this potential. Through partnerships with researchers in other fields, comparative genomics can address questions in human health and basic biology while guiding efforts to protect the biodiversity that is essential to these discoveries. Comparing genomes from 240 species to explore the evolution of placental mammals. Our new phylogeny (black lines) has alternating gray and white shading, which distinguishes mammalian orders (labeled around the perimeter). Rings around the phylogeny annotate species phenotypes. Seven species with diverse traits are illustrated, with black lines marking their branch in the phylogeny. Sequence conservation across species is described at the top left. IMAGE CREDIT: K. MORRILL 
    more » « less
  5. Abstract

    Reading entails transforming visual symbols to sound and meaning. This process depends on specialized circuitry in the visual cortex, the visual word form area (VWFA). Recent findings suggest that this text‐selective cortex comprises at least two distinct subregions: the more posterior VWFA‐1 is sensitive to visual features, while the more anterior VWFA‐2 processes higher level language information. Here, we explore whether these two subregions also exhibit different patterns of functional connectivity. To this end, we capitalize on two complementary datasets: Using the Natural Scenes Dataset (NSD), we identify text‐selective responses in high‐quality 7T adult data (N = 8), and investigate functional connectivity patterns of VWFA‐1 and VWFA‐2 at the individual level. We then turn to the Healthy Brain Network (HBN) database to assess whether these patterns replicate in a large developmental sample (N = 224; age 6–20 years), and whether they relate to reading development. In both datasets, we find that VWFA‐1 is primarily correlated with bilateral visual regions. In contrast, VWFA‐2 is more strongly correlated with language regions in the frontal and lateral parietal lobes, particularly the bilateral inferior frontal gyrus. Critically, these patterns do not generalize to adjacent face‐selective regions, suggesting a specific relationship between VWFA‐2 and the frontal language network. No correlations were observed between functional connectivity and reading ability. Together, our findings support the distinction between subregions of the VWFA, and suggest that functional connectivity patterns in the ventral temporal cortex are consistent over a wide range of reading skills.

     
    more » « less