NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Bigger is not always better: The importance of human-scale language modeling for psycholinguistics

https://doi.org/10.1016/j.jml.2025.104650

Wilcox, Ethan Gotlieb; Hu, Michael Y; Mueller, Aaron; Warstadt, Alex; Choshen, Leshem; Zhuang, Chengxu; Williams, Adina; Cotterell, Ryan; Linzen, Tal (October 2025, Journal of Memory and Language)

Free, publicly-accessible full text available October 1, 2026
What Makes Reading Comprehension Questions Difficult?

https://doi.org/10.18653/v1/2022.acl-long.479

Sugawara, Saku; Nangia, Nikita; Warstadt, Alex; Bowman, Samuel (January 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers)

Full Text Available
When Do You Need Billions of Words of Pretraining Data?

Zhang, Yian; Warstadt, Alex; Li, Haau-Sing; Bowman, Samuel R. (January 2021, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics)
null (Ed.)
NLP is currently dominated by language models like RoBERTa which are pretrained on billions of words. But what exact knowledge or skills do Transformer LMs learn from large-scale pretraining that they cannot learn from less data? To explore this question, we adopt five styles of evaluation: classifier probing, information-theoretic probing, unsupervised relative acceptability judgments, unsupervised language model knowledge probing, and fine-tuning on NLU tasks. We then draw learning curves that track the growth of these different measures of model ability with respect to pretraining data volume using the MiniBERTas, a group of RoBERTa models pretrained on 1M, 10M, 100M and 1B words. We find that these LMs require only about 10M to 100M words to learn to reliably encode most syntactic and semantic features we test. They need a much larger quantity of data in order to acquire enough commonsense knowledge and other skills required to master typical downstream NLU tasks. The results suggest that, while the ability to encode linguistic features is almost certainly necessary for language understanding, it is likely that other, unidentified, forms of knowledge are the major drivers of recent improvements in language understanding among large pretrained models.
more » « less
Full Text Available
What Ingredients Make for an Effective Crowdsourcing Protocol for Difficult NLU Data Collection Tasks?

Nangia, Nikita; Sugawara, Saku; Trivedi, Harsh; Warstadt, Alex; Vania, Clara; Bowman, Samuel R. (January 2021, Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics)
null (Ed.)
Crowdsourcing is widely used to create data for common natural language understanding tasks. Despite the importance of these datasets for measuring and refining model understanding of language, there has been little focus on the crowdsourcing methods used for collecting the datasets. In this paper, we compare the efficacy of interventions that have been proposed in prior work as ways of improving data quality. We use multiple-choice question answering as a testbed and run a randomized trial by assigning crowdworkers to write questions under one of four different data collection protocols. We find that asking workers to write explanations for their examples is an ineffective stand-alone strategy for boosting NLU example difficulty. However, we find that training crowdworkers, and then using an iterative process of collecting data, sending feedback, and qualifying workers based on expert judgments is an effective means of collecting challenging data. But using crowdsourced, instead of expert judgments, to qualify workers and send feedback does not prove to be effective. We observe that the data from the iterative protocol with expert assessments is more challenging by several measures. Notably, the human--model gap on the unanimous agreement portion of this data is, on average, twice as large as the gap for the baseline protocol data.
more » « less
Full Text Available
CAN NEURAL NETWORKS ACQUIRE A STRUCTURAL BIAS FROM RAW LINGUISTIC DATA?

Warstadt, Alex; Bowman, Samuel R. (January 2020, Proceedings of the Annual Meeting of the Cognitive Science Society)
null (Ed.)
We evaluate whether BERT, a widely used neural network for sentence processing, acquires an inductive bias towards forming structural generalizations through pretraining on raw data. We conduct four experiments testing its preference for structural vs. linear generalizations in different structure-dependent phenomena. We find that BERT makes a structural generalization in 3 out of 4 empirical domains---subject-auxiliary inversion, reflexive binding, and verb tense detection in embedded clauses---but makes a linear generalization when tested on NPI licensing. We argue that these results are the strongest evidence so far from artificial learners supporting the proposition that a structural bias can be acquired from raw data. If this conclusion is correct, it is tentative evidence that some linguistic universals can be acquired by learners without innate biases. However, the precise implications for human language acquisition are unclear, as humans learn language from significantly less data than BERT.
more » « less
Full Text Available
BLiMP: The Benchmark of Linguistic Minimal Pairs for English

https://doi.org/10.1162/tacl_a_00321

Warstadt, Alex; Parrish, Alicia; Liu, Haokun; Mohananey, Anhad; Peng, Wei; Wang, Sheng-Fu; Bowman, Samuel R. (December 2020, Transactions of the Association for Computational Linguistics)
null (Ed.)
We introduce The Benchmark of Linguistic Minimal Pairs (BLiMP), 1 a challenge set for evaluating the linguistic knowledge of language models (LMs) on major grammatical phenomena in English. BLiMP consists of 67 individual datasets, each containing 1,000 minimal pairs—that is, pairs of minimally different sentences that contrast in grammatical acceptability and isolate specific phenomenon in syntax, morphology, or semantics. We generate the data according to linguist-crafted grammar templates, and human aggregate agreement with the labels is 96.4%. We evaluate n-gram, LSTM, and Transformer (GPT-2 and Transformer-XL) LMs by observing whether they assign a higher probability to the acceptable sentence in each minimal pair. We find that state-of-the-art models identify morphological contrasts related to agreement reliably, but they struggle with some subtle semantic and syntactic phenomena, such as negative polarity items and extraction islands.
more » « less
Full Text Available
Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

https://doi.org/10.18653/v1/2020.acl-main.768

Jeretic, Paloma; Warstadt, Alex; Bhooshan, Suvrat; Williams, Adina (January 2020, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics)
null (Ed.)
Natural language inference (NLI) is an increasingly important task for natural language understanding, which requires one to infer whether a sentence entails another. However, the ability of NLI models to make pragmatic inferences remains understudied. We create an IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of 32K semi-automatically generated sentence pairs illustrating well-studied pragmatic inference types. We use IMPPRES to evaluate whether BERT, InferSent, and BOW NLI models trained on MultiNLI (Williams et al., 2018) learn to make pragmatic inferences. Although MultiNLI appears to contain very few pairs illustrating these inference types, we find that BERT learns to draw pragmatic inferences. It reliably treats scalar implicatures triggered by “some” as entailments. For some presupposition triggers like “only”, BERT reliably recognizes the presupposition as an entailment, even when the trigger is embedded under an entailment canceling operator like negation. BOW and InferSent show weaker evidence of pragmatic reasoning. We conclude that NLI training encourages models to learn some, but not all, pragmatic inferences.
more » « less
Full Text Available
Learning Which Features Matter: RoBERTa Acquires a Preference for Linguistic Generalizations (Eventually)

https://doi.org/10.18653/v1/2020.emnlp-main.16

Warstadt, Alex; Zhang, Yian; Li, Xiaocheng; Liu, Haokun; Bowman, Samuel R. (January 2020, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP))
null (Ed.)
One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during finetuning. We pretrain RoBERTa from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa_BASE. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa_BASE does consistently demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter.
more » « less
Full Text Available
Neural Network Acceptability Judgments

https://doi.org/10.1162/tacl_a_00290

Warstadt, Alex; Singh, Amanpreet; Bowman, Samuel R. (March 2019, Transactions of the Association for Computational Linguistics)

This paper investigates the ability of artificial neural networks to judge the grammatical acceptability of a sentence, with the goal of testing their linguistic competence. We introduce the Corpus of Linguistic Acceptability (CoLA), a set of 10,657 English sentences labeled as grammatical or ungrammatical from published linguistics literature. As baselines, we train several recurrent neural network models on acceptability classification, and find that our models outperform unsupervised models by Lau et al. (2016) on CoLA. Error-analysis on specific grammatical phenomena reveals that both Lau et al.’s models and ours learn systematic generalizations like subject-verb-object order. However, all models we test perform far below human level on a wide range of grammatical constructions.
more » « less
Full Text Available
Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs

https://doi.org/10.18653/v1/D19-1286

Warstadt, Alex; Cao, Yu; Grosu, Ioana; Peng, Wei; Blix, Hagen; Nie, Yining; Alsop, Anna; Bordia, Shikha; Liu, Haokun; Parrish, Alicia; et al (November 2019, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP))

Full Text Available

« Prev Next »

Search for: All records