<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>What Can String Probability Tell Us About Grammaticality?</title></titleStmt>
			<publicationStmt>
				<publisher>Transactions of the Association for Computational Linguistics</publisher>
				<date>01/12/2026</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10670804</idno>
					<idno type="doi">10.1162/TACL.a.611</idno>
					<title level='j'>Transactions of the Association for Computational Linguistics</title>
<idno>2307-387X</idno>
<biblScope unit="volume">14</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Jennifer Hu</author><author>Ethan Gotlieb Wilcox</author><author>Siyuan Song</author><author>Kyle Mahowald</author><author>Roger P Levy</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>What have language models (LMs) learned about grammar? This question remains hotly debated, with major ramifications for linguistic theory. However, since probability and grammaticality are distinct notions in linguistics, it is not obvious what string probabilities can reveal about an LM’s underlying grammatical knowledge. We present a theoretical analysis of the relationship between grammar, meaning, and string probability, based on simple assumptions about the generative process of corpus data. Our framework makes three predictions, which we validate empirically using 280K sentence pairs in English and Chinese: (1) correlation between the probability of strings within minimal pairs, i.e., string pairs with minimal semantic differences; (2) correlation between models’ and humans’ deltas within minimal pairs; and (3) poor separation in probability space between unpaired grammatical and ungrammatical strings. Our analyses give theoretical grounding for using probability to learn about LMs’ structural knowledge, and suggest directions for future work in LM grammatical evaluation.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Understanding what probabilistic language models (LMs) can learn about grammar has major ramifications for theories of language learning and structure <ref type="bibr">(Linzen, 2019;</ref><ref type="bibr">Warstadt and Bowman, 2022;</ref><ref type="bibr">Baroni, 2022;</ref><ref type="bibr">Piantadosi, 2024)</ref>. In the past decade, there have been many efforts to evaluate LMs' grammatical knowledge (e.g., <ref type="bibr">Warstadt et al., 2020;</ref><ref type="bibr">Hu et al., 2020;</ref><ref type="bibr">Linzen et al., 2016;</ref><ref type="bibr">Tjuatja et al., 2025)</ref>, with some asserting that models have largely achieved grammatical competence (e.g., <ref type="bibr">Mahowald et al., 2024)</ref> and others much more skeptical (e.g., <ref type="bibr">Dentella et al., 2023;</ref><ref type="bibr">Lan et al., 2024;</ref><ref type="bibr">Fox and Katzir, 2024)</ref>.</p><p>Some linguistic theories would posit that the ideal competence grammar would assign 0 probability to all ungrammatical strings. But LMs, by their nature, will assign non-zero probability to all strings. And by virtue of what they are designed for (modeling language in real-world contexts), it is not a desirable property of LMs that they assign 0 probability to ungrammatical strings. After all, in any realistic application setting, LMs would need to be able to interpret and handle ungrammatical utterances. If we are willing to accept that LMs will assign non-zero probability to ungrammatical strings, while potentially being able to represent grammatical generalizations in a theoretically meaningful way, then the scientific task of assessing grammatical knowledge in LMs requires working around this property.</p><p>Part of the field's uncertainty over LMs' grammatical competence stems from uncertainty over how to best assess grammatical knowledge in models. Given the success and convenience of prompting methods, a tempting approach is to simply ''ask'' models what sentences are grammatical or not <ref type="bibr">(Dentella et al., 2023;</ref><ref type="bibr">Katzir, 2023)</ref>, just as is commonly done for humans <ref type="bibr">(Sch&#252;tze, 2016;</ref><ref type="bibr">Sprouse and Almeida, 2012;</ref><ref type="bibr">Mahowald et al., 2016)</ref>. But answering ''Is this sentence grammatical?'' requires more than just knowledge of grammar: It requires knowing what grammaticality means, as well as other auxiliary abilities such as being able to (truthfully) answer questions. As a result, this method systematically underestimates grammatical competence in LMs <ref type="bibr">(Hu and Levy, 2023;</ref><ref type="bibr">Hu et al., 2024)</ref>. It's easy to see the problem if we imagine a model trained without ever seeing the word ''grammatical'': It would have the same underlying knowledge of linguistic structure but be unable to answer the question.</p><p>An alternate approach is to measure the probabilities that models assign to strings, with the logic that models should assign higher probability to grammatical versus ungrammatical strings. But it is not immediately obvious that this is the best way to assess LMs' grammatical abilities, as grammaticality and probability are fundamentally distinct notions in linguistics <ref type="bibr">(Chomsky, 1957;</ref><ref type="bibr">Berwick, 2018)</ref>. A well-known illustration of this distinction is the sentence ''Colorless green ideas sleep furiously'' <ref type="bibr">(Chomsky, 1957)</ref>. The string has low probability (at least when it was originally coined), but, crucially, people still have the intuition that it is grammatical in a way ''*Furiously sleep ideas green colorless'' isn't. This line of thinking might make it seem like assessing models via string probability is fundamentally flawed. Indeed, critics have argued that the distinction between likelihood and grammaticality is ''entirely foreign'' <ref type="bibr">(Katzir, 2023)</ref> to LMs, making them unsuitable models of grammatical competence.</p><p>However, <ref type="bibr">Fox and Katzir (2024)</ref>, <ref type="bibr">Lan et al. (2024)</ref>, and others go on to note that, in some cases, probability may be aligned enough with grammatically that it can be informative. And in practice, probability-based evaluations of grammatical knowledge in LMs use minimal pairsi.e., pairs of sentences that differ only slightly from each other, and which form a grammaticality contrast <ref type="bibr">(Marvin and Linzen, 2018;</ref><ref type="bibr">Futrell et al., 2019;</ref><ref type="bibr">Warstadt et al., 2020;</ref><ref type="bibr">Hu et al., 2020;</ref><ref type="bibr">Wilcox et al., 2023;</ref><ref type="bibr">Hu et al., 2024)</ref>. In-tuitively, researchers construct minimal pairs to isolate a specific grammatical contrast and factor out other properties that might affect string probability, such as length or lexical frequencies. But even this practice has also been criticized by recent work. For example, <ref type="bibr">Leivada et al. (2024a)</ref> write: ''If LMs need specific comparisons in order to tell apart grammatical from ungrammatical sentences, this already counts as an inherent discrepancy from humans, who are able to make such judgments without such a comparison''. If string probability offers a window into grammaticality, they argue, then it should be possible to find a threshold on probability that separates grammatical and ungrammatical sentences <ref type="bibr">(Leivada et al., 2024a,b)</ref>.</p><p>Here, we give a formal argument for why the minimal pair approach can be appropriate, and does not necessarily elide the distinction between grammaticality and string probability. Broadly, our framework states that the probability of a string comes from two latent variables: the string's message and the string's grammaticality. The logic of minimal pair judgments follows naturally from this framework. All else equal, grammatical sentences obtain higher probability than ungrammatical sentences. So if two utterances convey the exact same message, but one is grammatical and one isn't, then the grammatical one should have a higher probability. In practice, this is hard to do, since any utterances that differ in the words they contain will convey at least slightly different messages. But if the messages are sufficiently close, then the minimal pair assumption can be used, and comparing the probabilities of the two strings will give insight into grammaticality.</p><p>Our framework also makes it clear that the probability of a message can overwhelm the contribution of grammaticality in determining string probability. Uncontroversially, a model that is well-calibrated to the world should assign higher probability to the string ''He went to the store'' than the string ''Cordelia went to the store'', despite the fact that both sentences are grammatical, since it is far more probable to express a message about some person with male pronouns than a person with an uncommon name such as Cordelia. But models will face competing pressures if tasked with comparing (a) ''Cordelia went to the store herself'' vs. (b) ''*He went to the store herself''. The former is clearly (more) grammatical, but the latter seems to convey a more probable message.<ref type="foot">foot_1</ref> Under our framework, a model that assigns higher probability to (b) than (a) would not necessarily be failing to capture the distinction in grammaticality. Rather, because the pair is not appropriately controlled, the probabilities of the two strings are confounded with the probabilities of their messages.</p><p>In the rest of this paper, we first give a formal characterization of string probabilities in corpora and models. Then, we derive three predictions and test them empirically on 280K sentence pairs in English and Chinese. First, we predict a correlation between the log-probability of grammatical and ungrammatical sentences that convey roughly the same message, since their message probability is controlled for. Second, we predict that differences in human acceptability judgments on appropriately controlled minimal pairs will align with differences in log-probability of the strings in the pair. Finally, we predict a lack of separation in string probability space between unpaired grammatical and ungrammatical sentences. While this phenomenon has been taken to indicate a failure of models to capture grammaticality <ref type="bibr">(Leivada et al., 2024a,b)</ref>, we argue that it follows from our framework under reasonable assumptions.</p><p>More generally, we see our contribution as providing theoretical grounding for the practice of using minimal-pair probability comparisons to assess LMs' grammatical knowledge. While this practice is widely used in NLP, it has generally been given only brief and informal justification in empirical work (e.g., <ref type="bibr">Warstadt et al., 2020)</ref>, or rejected altogether (e.g., <ref type="bibr">Leivada et al., 2024a)</ref>. We use our theoretical analysis and empirical results to motivate recommendations for future work on grammatical evaluation of LMs.</p><p>2 Theoretical Framework</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Strings, Messages, and Grammaticality</head><p>We consider a word, w, drawn from a vocabulary &#931;, as well as strings, s &#8712; &#931; * which are sequences of words. We write w n for the word at index n in a string s N = [w 1 . . . w n . . . w N ], where 1 &#8804; n &#8804; N . Let S denote a random variable that ranges over strings. Additionally, let M be a random variable that ranges over possible messages m &#8712; M. Finally, let G be a binary random variable. When G = 1, the intended message m is realized according to the grammatical rules of the language. When G = 0, m is not realized according to the grammatical rules of the language-i.e., there is an error in the process of realizing the message in string form.</p><p>In our framework, the probability of a string s is influenced by possible underlying messages and whether those messages are grammatically realized. We therefore write this probability as:</p><p>Natural language is ambiguous, so strings often have more than one meaning. It is also arguably the case that some meanings can equivalently be realized by more than one choice regarding string realization (e.g., ''Sam gave presents to the children'' and ''Sam gave the children presents'' both realizing the same description of a transfer-of-possession event). For simplicity of mathematical treatment, however, our framework treats messages as equivalence classes of meanings plus string realization choices, with the specific underlying meaning probabilistically marginalized out. This is formalized in the following assumption: Assumption 1. Deterministic mapping from messages to strings when G = 1</p><p>We assume that P (s|m, G = 1) and P (m|s, G = 1) is deterministic. That is, given an intended message m, there is only one way to realize m according to the rules of the grammar.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 1. Grammatical string</head><p>We say that a string s is grammatical if, for some message m &#8712; M, P (s|m, G = 1) = 1. Note that by Assumption 1, for every grammatical string s, P (m|s, G = 1) = 1 for exactly one m.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 2. Ungrammatical string</head><p>We say that a string s is ungrammatical if there is no m for which P (s|m, G = 1) &gt; 0. By Assumption 1, this is equivalent to saying there is no m for which P (s|m, G = 1) = 1. Note that this does not mean that a string s is ungrammatical if for some m, P (s|m, G = 0) &gt; 0, as grammatical strings can be generated by an errorful realization of a message.</p><p>There is no ''meaning'' of an ungrammatical string in the sense that there is a unique message associated with a grammatical string (by Assumption 1). But there is a probability distribution over messages associated with an ungrammatical string, and in some cases it will be useful to specify the ''most likely message'' of an ungrammatical string. The details of how ungrammatical strings relate to messages will depend on an error model, which we discuss in Section 2.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1">Toy Example</head><p>We now walk through a simple example to illustrate the key intuitions and definitions discussed above. Consider the set of 8 strings formed by crossing two possible values of three syntactic elements: ''{The, A} {moon, moons} {emerge, emerges}''. These strings are visualized as nodes on a cube in Figure <ref type="figure">1a</ref>. Each edge between strings represents one edit: In this case, swapping ''The''/''A'', or ''moon''/''moons'', or ''emerge''/''emerges''. Here, there are three strings {s 1 , s 2 , s 3 } which can be viewed as the error-free realizations of three messages M = {m 1 , m 2 , m 3 }, respectively:</p><p>(2)</p><p>s 1 = ''The moon emerges.''</p><p>s 3 = ''The moons emerge.'' Therefore, s 1 , s 2 , and s 3 are grammatical. However, note that even these grammatical strings could be generated by errorful processes for certain messages: For example, presumably</p><p>The other 5 strings in the set can each be viewed as errorful realizations of any of the messages m 1 , m 2 , or m 3 . Furthermore, there is no m &#8712; M for which any of these strings is the error-free realization. Therefore, the other 5 strings are ungrammatical.</p><p>Although for a given m these strings are all errorful realizations, they might not all be equally likely. In our framework, string probability is influenced by the likelihood of (potential) underlying messages. Here, we expect P (m 1 ) &gt; P (m 2 ) &gt; P (m 3 ): i.e., it is most probable (at least on Earth) to express a message about a unique moon, somewhat probable to express a message about a single moon, and improbable to express a message about multiple moons. These differences in message probabilities can conflict with grammaticality in practice. Figure <ref type="figure">1b</ref> shows surprisals (i.e., negative log probability) assigned by GPT-2 <ref type="bibr">(Radford et al., 2019)</ref> to each of the 8 toy strings from Figure <ref type="figure">1a</ref>. While the 3 grammatical strings have lower surprisals on average than the 5 ungrammatical strings, we also see that the strings which mention a singular moon tend to have lower surprisal. For example, GPT-2 assigns higher surprisal to the grammatical ''The moons emerge'' than the ungrammatical ''*The moon emerge''.</p><p>Another factor that affects the probability of ungrammatical strings is the way they are realized under a reasonable error model. We discuss this in more detail in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Error Model</head><p>The role of errors in speech comprehension and production has been studied extensively (e.g., <ref type="bibr">Levy, 2008;</ref><ref type="bibr">Goldrick, 2011)</ref>. We adopt a set of minimal working assumptions required for a basic error model. Let s * (m) be the ''grammatical realization'' of m: i.e., the string s such that P (s|m, G = 1) = 1. By Assumption 1, this grammatical realization is unique. Then, P (s|m, G = 0) is concentrated in an ''error neighborhood'' of s * (m) that excludes s * (m) . That is, although violating a language's syntax, an ungrammatical realization of a message m tends to be mostly similar to a grammatical realization of m.</p><p>We can then quantify the ''error distance'' from one string s 1 to another s 2 conditioned on m as a distance, D(s 1 &#8594; s 2 |m). This distance ranges over non-negative integer values, with D(s 1 &#8594; s 2 |m) = 0 if and only if s 1 = s 2 , and D(s 1 &#8594; s 2 |m) = 1 if the two strings differ by a single error. We assume that the probability of each error step is some small value centered around &#1013;. Thus the number of error steps d is geometrically distributed, P (d) &#8776; (1&#1013;)&#1013; d , and</p><p>For any specific string s &#824; = s * (m) , this gives us:<ref type="foot">foot_2</ref> </p><p>for s &#824; = s * (m) , where K (which we treat as constant) denotes the number of different possible errors that could be made at any given error step. This distance can be thought of as the number of errors required to change s to s * (m) , if m is the intended message.</p><p>Returning to the notion of ''meaning'' of ungrammatical strings, while ungrammatical strings do not have a unique message in our framework, they can be thought of as corresponding to messages from nearby grammatical strings. If we assume that errors are relatively rare and the space of messages is relatively sparse, then in many cases the most likely message of an ungrammatical string will be transparent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.1">Toy Example</head><p>Returning to the toy example from Section 2.1.1 and Figure <ref type="figure">1a</ref>, we can think of each edit between a grammatical string (blue node) and ungrammatical string (red node) as an error. We can see how messages relate to ungrammatical strings. Consider an ungrammatical string such as ''*A moon emerge''. The most likely message of this string is m 2 , corresponding to the message associated with the closest grammatical string, ''A moon emerges''. While ''*A moon emerge'' could in principle be an ungrammatical realization of the other two messages (m 1 and m 3 ), the realization process would involve more errors, making m 1 and m 3 less likely to be the underlying message.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Minimal Pairs</head><p>We now arrive at our definition of minimal pairs. Definition 3. Meaning-matched pair A meaning-matched pair is a pair of strings (s, s &#8242; ) such that (1) s is the grammatical realization of some message m; (2) s &#8242; is ungrammatical; and (3) s &#8242; is a reasonably likely ungrammatical realization of m; i.e.:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 4. Minimal pair</head><p>A meaning-matched minimal pair, or minimal pair for short, is a meaning-matched pair (s, s &#8242; ) such that D(s &#8594; s &#8242; |m) = 1 when m = arg max M P (M |s).</p><p>In other words, a minimal pair is a meaningmatched pair where there is only one thing ''wrong'' with the ungrammatical string s &#8242; .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.1">Toy Example</head><p>Recall our simple example from Section 2.1.1 and Figure <ref type="figure">1a</ref>. Here, the set of meaning-matched pairs is given by every pair of grammatical and ungrammatical strings (i.e., blue and red in Figure <ref type="figure">1a</ref>). In this case, there are 15 such pairs, formed by pairing each of {s 1 , s 2 , s 3 } with each of the five ungrammatical strings. Accordingly, the set of minimal pairs is the subset of meaning-matched pairs which include one of the grammatical strings {s 1 , s 2 , s 3 } and another string which is one edge (i.e., error) away from the grammatical string. In this example, there are 7 such pairs.</p><p>An example meaning-matched pair which is not a minimal pair would be (s 1 = ''The moon emerges'', s &#8242; = ''*A moons emerge''), as m 1 is the message associated with s 1 , and there are multiple errors needed to generate s &#8242; from m 1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Three Predictions</head><p>We now describe three predictions that fall out of our framework, with additional assumptions that we specify along the way. The full derivation for each prediction is given in Section A. After outlining these predictions, the rest of the paper is dedicated to testing them, empirically.</p><p>Prediction 1. Correlation between the logprobability of grammatical and ungrammatical strings within minimal pairs.</p><p>If string probability only depends on grammaticality G, then all ungrammatical strings should receive the same (near-)zero probability. In contrast, our framework states that M also plays a role. We predict that the probability of a grammatical string is primarily determined by the probability of its message, and the probability of an ungrammatical string is primarily determined by the probability of the message of the nearest grammatical neighbor string. We therefore expect to see a correlation between the log-probability of the grammatical string and the log-probability of the ungrammatical string across sets of minimal pairs (Prediction 1a).</p><p>However, minimal pairs are a theoretical ideal, and in practice not all researcher-constructed minimal pairs will be truly ''minimal''. When we consider pairs where the most probable message of the grammatical and ungrammatical strings are less similar-i.e., when the pairs are ''less minimal''-the contribution of M is less controlled. Therefore, we predict a weaker correlation for pairs that are less minimal (Prediction 1b).</p><p>Prediction 2. Correlation between differences in log-probability and human acceptability judgments within minimal pairs.</p><p>Native speaker acceptability judgments vary with both grammatical well-formedness and meaning plausibility <ref type="bibr">(Sch&#252;tze, 2016)</ref>. Using our framework, we operationalize (i) with log P (m), and (ii) with the number of errors. Then if we consider minimal pairs (s, s &#8242; ), where the understood message between s and s &#8242; is the same, the difference in acceptability judgments depends primarily on the error probability of taking s to s &#8242; , as does the difference in string log-probability. We therefore predict that differences in string probability are correlated with differences in human acceptability judgments, within minimal pairs (Prediction 2a). As the ''minimalness'' of the pair decreases, the contribution of message M increases, and we expect to find weaker correlation between logprobability differences and acceptability judgment differences (Prediction 2b). Prediction 3. Potentially poor separation based on probability between grammatical / ungrammatical strings.</p><p>In illustrating the distinction between probability and grammaticality, <ref type="bibr">Chomsky (1957)</ref> wrote: ''If we rank the sequences of a given length in order of statistical approximation to English, we will find both grammatical and ungrammatical sequences scattered throughout the list''. By viewing probability as influenced by grammaticality and meaning, our framework provides a theoretical basis for Chomsky's prediction: That is, string probability does not separate grammatical and ungrammatical strings. While <ref type="bibr">Leivada et al. (2024a,b)</ref> argue that this lack of separation means that it is problematic to use probability to measure grammaticality, we note that this prediction falls directly out of our theoretical framework.</p><p>Importantly, the expected degree of separation depends on how grammatical and ungrammatical strings are pooled together. When the strings come from minimal pairs, we would expect to see better separation on the basis of probability, as the message probabilities are controlled for. When there are no constraints on the relationship between the grammatical/ungrammatical strings, however, the distributions of messages associated with the grammatical and ungrammatical strings can be very different from each other, and string probability might achieve poor separation.</p><p>We note that so far we have only discussed pure string probability, which has been investigated in recent studies <ref type="bibr">(Leivada et al., 2024a,b)</ref>. There are reasons to expect that grammatical/ungrammatical strings would not be separated by pure probability: When there is no upper bound on the length of grammatical strings and at least some ungrammatical string is assigned non-zero probability (as is the case for LMs), then there must be an infinite set of grammatical strings that are assigned lower probability than that ungrammatical string. Indeed, our framework also suggests that replacing pure string probability with a ''normalizing'' function of string probability and other</p><p>Dataset Language # items Reference Prediction 1 Prediction 2 Prediction 3 BLiMP English 66993 Warstadt et al. (2020) &#10003; &#10007; &#10003; SCaMP-P English 67000 McCoy and Griffiths (2025) &#10003; &#10007; &#10003; SCaMP-I English 67000 McCoy and Griffiths (2025) &#10003; &#10007; &#10003; SyntaxGym English 1018 Hu et al. (2020) &#10003; &#10007; &#10003; ZhoBLiMP Chinese 35400 Liu et al. (2024) &#10003; &#10007; &#10007; SLING Chinese 39976 Song et al. (2022) &#10003; &#10007; &#10007; CoLA English 8551 Warstadt et al. (2019) &#10007; &#10007; &#10003; LI English 1883 Sprouse et al. (2013); Mahowald et al. (2016) &#10007; &#10003; &#10003; HLL Chinese 213 Chen et al. (2020) &#10007; &#10003; &#10007; (a) Model HuggingFace ID Language # params Vocab size Training data GPT-2 gpt2 English 124M 50257 40 GB Llama-3-70B meta-llama/Meta-Llama-3-70B English 70B 128256 15T tokens GPT-2 ZH uer/gpt2-chinese-cluecorpussmall Chinese 102M 21128 100 GB Llama-3-8B ZH hfl/llama-3-chinese-8b Chinese 8B 128256 120 GB (b) Table 1: (a) Datasets and (b) models used in our experiments. ''# items'' = # pairs for each dataset except CoLA, and # sentences for CoLA. ''SCaMP-P/I'' = plausible/implausible subsets of SCaMP.</p><p>grammar-independent features should bring the distributions of messages associated with grammatical and ungrammatical strings closer together, and thereby increase the separability. This prediction provides a novel justification for previous work which has hypothesized that grammaticality and probability are linked through a complex function <ref type="bibr">(Pauls and Klein, 2012;</ref><ref type="bibr">Lau et al., 2017;</ref><ref type="bibr">Tjuatja et al., 2025)</ref>.</p><p>In the rest of this paper, we report the results of three experiments designed to empirically test the three predictions spelled out above.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Prediction 1: Correlation Between</head><p>Grammatical/Ungrammatical Log-Probability Within Minimal Pairs</p><p>We now investigate the first set of predictions made by the theory in Section 2: (a) probabilities of grammatical and ungrammatical strings in minimal pairs should be correlated, and (b) this correlation should be weaker for less ''minimal'' pairs. Our code and data are all publicly available at <ref type="url">https://github.com/jennhu/probability  -grammaticality</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Evaluation Materials</head><p>To test Prediction 1, we need a set of paired grammatical and ungrammatical sentences with varying degrees of ''minimalness''. We test our theory on existing data sets from two typologically different languages, English and Mandarin Chinese, as our theoretical framework is language-agnostic and only makes basic assumptions about the generative process of corpus strings. We evaluate models on five datasets, summarized in Table <ref type="table">1a</ref>: BLiMP <ref type="bibr">(Warstadt et al., 2020)</ref>, SCaMP (McCoy and Griffiths, 2025), and SyntaxGym<ref type="foot">foot_4</ref>  <ref type="bibr">(Hu et al., 2020)</ref> in English, and ZhoBLiMP <ref type="bibr">(Liu et al., 2024)</ref> and SLING <ref type="bibr">(Song et al., 2022)</ref> in Chinese. Each dataset is proposed to contain ''minimal pairs'', although the pairs may diverge to varying degrees from the theoretical ideal defined in Definition 4.</p><p>One attractive feature of these datasets is that they collectively vary in terms of semantic plausibility. For example, SyntaxGym was manually designed to avoid implausible sentences; BLiMP, because of its templatic generation, includes a mix of plausible and implausible; and SCaMP includes plausible and implausible subsets. This allows us to test our framework on a space of messages M that is not restricted to probable or commonplace ones: If a pair of sentences shares the same underlying message, then probability can reveal information about grammaticality, regardless of how probable the message itself is a priori.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Measuring Minimalness</head><p>To empirically evaluate Prediction 1b, we need a way of quantifying the ''minimalness'' of a putative minimal pair. According to our framework, a natural way to measure the minimalness of a minimal pair (s, s &#8242; ) would be to measure the similarity between the message of s (i.e., m = arg max M P (M |s)) and the message of the closest grammatical string to s &#8242; . That is: If the ungrammatical string s &#8242; were in a ''true'' minimal pair (according to Definition 4) with another grammatical string s * , how similar in meaning is s to s * ?</p><p>In practice, this quantity is difficult to systematically estimate, as it involves specifying the closest grammatical string to each ungrammatical string in a dataset of minimal pairs. We approximated this quantity by measuring the similarity between the message of s and the ''message'' of s &#8242; , taking a usage-based approach to meaning. Namely, we adopted the assumption that sentences that convey similar messages will be closer in a high-dimensional embedding space learned for meaning-based tasks. To quantify the minimalness of a pair, we therefore measured the cosine similarity (in practice, the cosine distance) between the embeddings of the grammatical and ungrammatical sentences in the pair. <ref type="foot">4</ref>In order to compute correlations at different levels of minimalness (as required to test Predictions 1b and 2b), we grouped pairs into 10 equally-sized bins based on the within-pair cosine distance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Computing String Probability</head><p>We compute the probability of a sentence by aggregating the probability of each token condi-tioned on all previous tokens using autoregressive language models. In practice, we do this by computing the log probability of each token conditioned on its left context, and summing these values to get the log probability of the full sentence.</p><p>The predictions of our framework apply to any probabilistic model that has learned a reasonably accurate distribution of P (S). We felt this to be a reasonable assumption for moderately-sized Transformer models trained on Internet-scale text. Although an interesting avenue for further research, directly testing the influence of specific factors such as model size was not the key motivation of our experiments, so we simply chose two models for each language, covering multiple sizes and model families. We evaluated two open-source base (i.e., not fine-tuned) models of varying sizes on each dataset (see Table <ref type="table">1b</ref> for details). We evaluated GPT-2 <ref type="bibr">(Radford et al., 2019)</ref> and Llama-3-70B (AI@Meta, 2024) on the English datasets. For the Chinese datasets, we evaluated a GPT-2 model trained on CLUECor-pusSmall <ref type="bibr">(Xu et al., 2020)</ref>, which we refer to as GPT-2 ZH <ref type="bibr">(Zhao et al., 2019)</ref>, and Llama-3-8B ZH <ref type="bibr">(Cui et al., 2023)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Results</head><p>The results (shown in Figure <ref type="figure">2</ref>) largely confirm our first set of predictions. As suggested by Prediction 1a, Figure <ref type="figure">2a</ref> shows a strong positive correlation between the log probability of the ungrammatical and grammatical sentences in each pair, across both datasets and model sets. Furthermore, Figure <ref type="figure">2b</ref> shows that the Pearson r correlation between grammatical and ungrammatical log probability decreases as the pairs are less controlled for meaning (i.e., as the cosine distance increases), as suggested by Prediction 1b. These patterns hold for multiple models, datasets, and languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Prediction 2: Probability Differences Align with Acceptability Differences</head><p>We now investigate the second set of predictions: (a) human acceptability judgment differences and string log-probability differences should be correlated within minimal pairs, and (b) this correlation should be weaker for pairs that are less minimal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Evaluation Materials</head><p>To test Prediction 2, we need human acceptability judgments of each sentence in isolation. Our evaluation datasets are summarized in Table <ref type="table">1a</ref>.</p><p>We only used existing data and did not collect any new human judgments. The English dataset (LI) includes pairs from <ref type="bibr">Sprouse et al. (2013)</ref> and <ref type="bibr">Mahowald et al. (2016)</ref>. Both studies randomly sampled paired sentences (grammatical vs. ungrammatical/questionable) from Linguistic Inquiry journal papers and collected acceptability judgments for each sentence from native English speakers. The Chinese dataset (HLL) includes pairs from <ref type="bibr">Chen et al. (2020)</ref>, where the authors collected acceptability judgments for each sentence within related groups (pairs, triples, or n-tuples) from a Chinese syntax textbook <ref type="bibr">(Huang et al., 2009)</ref>. 5  Each of the three data sources we use originally measured acceptability judgments on a 7-point Likert scale for each sentence in isolation. We z-scored (centered and scaled) all Likert score ratings within participants and then calculated the mean z-score for each sentence. For each pair, the difference in mean z-scores between the grammatical and ungrammatical sentences indicates the human acceptability judgment difference.</p><p>Filtering for Meaning-matched Pairs. While the grammatical and ungrammatical sentences in both datasets are presented by their original authors as ''pairs'', they vary widely in terms of 5 For the groups with more than two sentences, we created pairs by juxtaposing all possible grammatical vs. ungrammatical/questionable pairings. That is, for a given group with grammatical sentences {g 1 , . . . , g n } and ungrammatical sentences {u 1 , . . . , u m } we create all the possible pairs (g i , u j ) for i &#8712; [1, n] and j &#8712; [1, m]; usually there are 2-4 sentences in a group. how similar they are to each other. For example, the LI dataset contains the pair formed by grammatical sentence ''The apples fell just a short fall to the lower deck, and so were not too badly bruised'' and ungrammatical sentence ''*The submarine emerged an abrupt emergence''. This sort of pair fails to meet our definition of ''meaning-matched'' pair (Definition 3): any reasonable value of &#948; would exclude this pair, as the ungrammatical sentence is extremely unlikely to be an errorful realization of the message of the grammatical sentence. These pairs are also potentially problematic for our analyses. We assume that messages do not vary dramatically with grammaticality (Assumption 3; Section A). But, when grammatical/ungrammatical sentences are paired in the way that LI and HLL were curated, there could be systematic differences in how messages are distributed across different values of grammaticality. That is, for radically non-meaning-matched pairs, the ungrammatical variant could map onto a systematically higher-probability or lower-probability message than the grammatical one. This is less of a concern when the pairs are tightly matched for meaning.</p><p>We therefore only kept sentence pairs which are reasonably ''meaning-matched''. We defined an empirical threshold that guards against pairs like the one above, but still allows for enough variability in minimalness to test our predictions. In practice, we did this by only keeping pairs where the Levenshtein edit distance between the strings was below the 75th quantile across all pairs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Results</head><p>The results for these analyses are displayed in Figure <ref type="figure">3</ref>. Figure <ref type="figure">3a</ref> shows that human Likert score differences are correlated with log probability differences (Prediction 2). We also note that similar results have been reported for other languages <ref type="bibr">(Suijkerbuijk et al., 2025)</ref>.</p><p>Figure <ref type="figure">3b</ref> shows that the degree of alignment (Pearson r correlation coefficient) decreases as cosine distance increases for the English LI dataset, as predicted by Prediction 2b. However, for the Chinese HLL dataset, we do not find clear evidence for Prediction 2b. There could be several reasons for the differences between the LI and HLL results. First, we note that LI has a substantially wider spread of cosine distances between the 70th and 95th quantiles, which is where the clearest drop in correlations is seen. It could also be the case that humans' acceptability judgments might reflect slightly different factors in Chinese versus English. Another potential cause might be practical issues with the models trained on Chinese data: e.g., a lower-quality sentence embedding model might not faithfully represent the similarity in underlying message between sentences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Prediction 3: Poor Separation of Grammatical/Ungrammatical Strings</head><p>We now test our final prediction: potentially poor separation between the probability of grammatical and ungrammatical strings. As discussed in Section 3 and Section A.3, our framework predicts that the limiting factor for such separation is the variance of messages associated with grammatical and ungrammatical string sets. Therefore, in addition to examining raw string probability, in this section we introduce several normalizing transformations that reduce variance in messages and should thereby also increase separability. Specifically, we propose a novel scoring function that represents the Bayes factor between different gen-erative processes of observed strings. And we give a novel derivation for the Syntactic Log Odds Ratio (SLOR; <ref type="bibr">Pauls and Klein, 2012;</ref><ref type="bibr">Lau et al., 2017)</ref>, as equivalent to the average Pointwise Mutual Information between a word and its preceding context. We find that neither raw string probability nor any normalized string probabilities result in good separation of grammatical/ungrammatical strings, in line with our prediction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Transformations of Probability</head><p>In addition to the notation introduced in Section 2.1, we define a language model p &#952; (s) : &#931; * &#8594; R as a function from strings to probabilities.</p><p>Here, we use w to refer to tokens instead of words.</p><p>In practice, the LMs we work with are autoregressive, assigning probabilities to tokens given their preceding context. That is, they are functions of the type s N &#8594; R N , mapping strings of length N to N -dimensional vectors of probabilities. We consider linking functions f : R N &#8594; R that map these vectors of token probabilities p &#952; (w n | s &lt;n ) to scores. Below, we enumerate several candidates for this function. For brevity, we will write f (s) instead of f (p &#952; (s)).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Metric 1. Probability</head><p>A natural starting point is to test whether raw probability p &#952; (s) can separate grammatical and ungrammatical strings in a language. Equivalently, we consider the log of this joint probability, or the sum of the log probabilities assigned to each word.</p><p>The ability of this metric to separate grammatical and ungrammatical sentences has been explicitly investigated <ref type="bibr">(Lau et al., 2017;</ref><ref type="bibr">Leivada et al., 2024a)</ref>, and is also the implicit standard for minimal pair comparisons (e.g., <ref type="bibr">Warstadt et al., 2020)</ref>.</p><p>Metric 2. Bayes Factor: Uniform Distribution When determining whether a sentence is grammatical, comprehenders may consider the datagenerating process that was likely to produce the string. One rational approach would be to evaluate the competing evidence for two hypotheses: that the string was produced by the grammar, and that the string was produced by a non-grammatical generative process. We instantiate this intuition by considering the Bayes Factor, or the ratio of the likelihoods, of the sentence under two hypotheses. Let H G denote the hypothesis that the grammar is the generating process, which we estimate with an LM's distribution p &#952; . And let H uniform denote the hypothesis that the generating process is simply a uniform distribution over the vocabulary &#931;. Given an observed string s, we can define the log Bayes factor between H G and H uniform as: Next, we consider the average statistical association between a word and its context. To instantiate this hypothesis, we use the pointwise mutual information (PMI) between a word and its preceding context. The PMI between realizations of two random variables is the log ratio of their joint probability assuming dependence and independence. Using average PMI, we derive the following transformation:</p><p>where p(w n ) is the unigram (i.e., frequency) estimate of word w n . This metric is equivalent to the Syntactic Log-Odds Ratio (SLOR) proposed by <ref type="bibr">Pauls and Klein (2012)</ref>, which has also been investigated by <ref type="bibr">Lau et al. (2017)</ref>, although the connection to PMI has not been previously established.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Metric 5. Mean Probability</head><p>As a simple variation of Equation ( <ref type="formula">5</ref>) that controls for length, we consider mean log probability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Evaluation Materials</head><p>To test Prediction 3, we no longer need paired sentences (as was the case for Predictions 1 and 2), but instead simply need large sets of grammatical and ungrammatical sentences from which to compute the relevant metrics. We evaluate models on five English datasets that contrast ungrammatical and grammatical sentences: the three minimalpair datasets used to test Prediction 1 (BLiMP, SCaMP, and SyntaxGym), the LI dataset used to test Prediction 2, and CoLA <ref type="bibr">(Warstadt et al., 2019)</ref>. <ref type="foot">6</ref> See Table <ref type="table">1a</ref> for a summary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Computing Separation</head><p>To quantify the degree of separation for a given linking function f , we first pool all grammatical and ungrammatical strings from the dataset into one flat set. We then compute all scores f (s) for each string s in this set, and compute a receiver operating characteristic (ROC) curve for these scores, treating grammatical strings as class 1 and ungrammatical as class 0. We use the area under the ROC curve (AUC) as our measure of separability, where AUC = 0.5 indicates no separability, and AUC = 1 indicates perfect separability. We evaluate the same two LMs used to evaluate the English datasets for Predictions 1 and 2: GPT-2 and Llama-3-70B (see Table <ref type="table">1b</ref>). To obtain token frequency measurements for SLOR, we sought to estimate the distribution of tokens in each model's training data.<ref type="foot">foot_7</ref> Since we do not have access to this data, we used the HuggingFace FineWeb dataset <ref type="bibr">(Penedo et al., 2024)</ref> as a representative sample of a high-quality Internet text corpus. We used each model to tokenize 1 million items from the sample-10BT sample of FineWeb, and then used this sample to estimate the token frequency distribution for each model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Results</head><p>Figure <ref type="figure">4a</ref> shows the scores assigned by both models to grammatical and ungrammatical sentences, collapsed across datasets. There appears to be substantial overlap between both sentence groups, across all metrics. This is confirmed by the ROC curves, shown in Section B / Figure <ref type="figure">6</ref>, as well as the corresponding AUC scores (which do not exceed 0.75), shown in Figure <ref type="figure">4b</ref>. Overall, mean logprob and SLOR consistently outperform the other metrics, suggesting that length normalization helps improve separability somewhat.</p><p>We also note that the models achieve high accuracy (&#8764;80%) on standard minimal pair comparisons (Section B, Figure <ref type="figure">8</ref>). So, by a minimal pair standard, models are sensitive to the relevant grammatical manipulations, but this is not reflected in (simple transformations of) string probability, as predicted by our theoretical proposal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Discussion</head><p>It is uncontroversial that string probability is not the same as grammaticality. But that does not mean that string probability cannot reveal information about a probabilistic model's underlying grammatical knowledge. Here, we argued that these probabilities are determined by a combination of the probability of the message and the grammaticality of the string. Our theoretical framework shows that some prior critiques of minimal-pair analysis-e.g., that probability does not robustly separate grammatical and ungrammatical strings <ref type="bibr">(Leivada et al., 2024a,b)</ref>-fall out of simple assumptions about the generative process underlying linguistic corpus data.</p><p>An offshoot of our analysis is that studies of grammaticality in LMs that use minimal pairs that are not tightly controlled (e.g., V&#225;zquez Mart&#237;nez et al., 2023) risk underestimating the grammatical competence of models by failing to control for the influence of M . In other words, a model could seemingly not differentiate between grammatical and ungrammatical strings, if the messages across grammatical and ungrammatical strings are not well-controlled. As we have argued here, this does not necessarily imply that the model has not learned generalizations about grammatical rules. If we are interested in isolating the model's sensitivity to grammaticality, we have to use carefully designed procedures to factor out M .</p><p>One such procedure is minimal pair string comparisons, where the members of the minimal pair are closely matched in M but are hypothesized to differ in G. While controlled minimal pair comparisons are hardly new in NLP (e.g., <ref type="bibr">Marvin and Linzen, 2018;</ref><ref type="bibr">Futrell et al., 2019)</ref>, we have provided new theoretical grounding for these practices. In addition, our work lays the foundation for using computational techniques to isolate the effects of M and G, which has been explored in recent work <ref type="bibr">(Sta&#324;czak et al., 2024)</ref>.</p><p>Our analyses also raise new questions regarding LMs' grammatical knowledge. The poor separation achieved by state-of-the-art LMs in Section 6 feels counterintuitive: If LMs virtually always produce grammatical strings (under standard sampling procedures), then why is there so much overlap between the probabilities assigned to grammatical and ungrammatical strings? This tension between discriminative failures and generative abilities could be seen as a specific realization of the ''generative AI paradox'' <ref type="bibr">(West et al., 2024)</ref>, and also connects to recent work demonstrating that language identification is impossible except in highly constrained cases <ref type="bibr">(Gold, 1967;</ref><ref type="bibr">Angluin, 1980)</ref>, whereas language generation is possible for any countable list of languages (Kleinberg and Mullainathan, 2024). <ref type="bibr">Leivada et al. (2024b)</ref> argue that comparing isolated acceptability judgments in humans against minimal-pair probability differences in models is not comparing ''apples with apples'', and thus unfair. We hope that our theoretical model can bring greater clarity to the issue of what makes a fair comparison. When comparing the (cognitive) abilities of two groups-humans and models, younger and older children, or even two different animal species-we maintain that the researcher must design assessments with that group's (cognitive) computational architecture in mind. Applying the same evaluation method, which might impose auxiliary challenges on one group but not the other, can artificially increase apparent intergroup differences <ref type="bibr">(Firestone, 2020;</ref><ref type="bibr">Lampinen, 2024;</ref><ref type="bibr">Hu and Frank, 2024)</ref>. For example: The intelligence of a squirrel should not be judged based on its ability to solve a Rubik's cube. 8 Similarly, using metalinguistic judgments or isolated string probability as a window into grammaticality ignores the reality of what LMs are, and what they are trained to do-namely, to maximize the probability of strings a corpus. More broadly, linguistic theory continues to suggest new ways to evaluate LMs, just as modern LMs provide new tools for studying the relationship between probability and grammaticality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Derivation of Predictions</head><p>Below we provide full derivations of each prediction from our framework. We first lay out several assumptions we make use of in our derivations. We believe these assumptions are generally plausible; moreover, cases where they do not apply might be informative for understanding LM behavior.</p><p>Assumption 2. Most realizations of intended messages are grammatical.</p><p>Formally, P (G = 0|m) is relatively small. Assumption 3. Probability of grammatical realization does not vary dramatically with intended message.</p><p>Formally, we will require that the covariance of log P (G = 0|m) and log P (m) is smaller than the variance of log P (m).</p><p>Taking Assumptions 2 and 3 together, we will write P (G = 0|m) &#8776; &#1013;, and thus P (G = 1 | m) &#8776; (1&#1013;), for some &#1013; &gt; 0. This leaves implicit that &#1013; potentially varies with m,<ref type="foot">foot_8</ref> but this variability is relatively small, compared to variability in the overall probability of m. We will additionally assume that &#1013; &#8810; (1&#1013;).</p><p>Assumption 4. The strings of interest are not in a dense ''error neighborhood'' of strings that grammatically encode messages of considerably higher probability than the strings' preferred messages.</p><p>For any string of interest s, let M d denote the set of messages {m i : D(s * (m i ) &#8594; s|m i ) = d}, let m * = arg max m P (m|s), and let P (M d ) = m&#8712;M d P (m). Formally, we will require that</p><p>Assumption 5. Regions distant in ''error space'' are not dramatically higher in message probability relative to their error distance.</p><p>Formally, we require that P (M d ) K d does not grow exponentially fast with rate 1 &#1013; . Assumption 6. For minimal pairs, intended message probabilities for grammatical strings of interest do not vary dramatically in how characteristic they are of message probabilities in the immediate error neighborhood of the ungrammatical string in the minimal pair.</p><p>Formally, for a minimal pair (s, s &#8242; ) with intended message m * = arg max m P (m|s), P (M 1 ) P (m * ) can be treated as constant.</p><p>A.1 Prediction 1 Prediction 1. Correlation between the log-probability of grammatical and ungrammatical strings within a minimal pair after controlling for meaning.</p><p>Let (s, s &#8242; ) be a minimal pair, where m * = arg max m P (m|s). The probability of a string is given by Equation (1), reproduced below: P (s) = m&#8712;M,g&#8712;{0,1} P (s|m, g)P (g|m)P (m) (10)</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/TACL.a.611/2575283/tacl.a.611.pdf by guest on 11 March 2026</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>In this case, the intended message could be ''She went to the store herself'' or ''He went to the store himself''. We discuss intended messages in more detail in Section 4.2.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>Equation</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>ignores the possibility that multiple error sequences from m might lead to the same s, which would increase P (s|m, G = 0) in a way that is ultimately canceled out in the derivations presented in our Appendix.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_4"><p>We use the subset of SyntaxGym compatible with sentence scoring.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_5"><p>We used sentence-transformers models allmpnet-base-v2 to embed English sentences and uer/ sbert-base-chinese-nli for Chinese.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6"><p>Here, we focus on English due to the computational demands of token-frequency estimation for SLOR.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_7"><p>Our SLOR metric is computed over tokens, not words.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_8"><p>For example, a message that would be realized by a triply-center-embedded sentence might be more likely to involve errorful realizations.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_9"><p>Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/TACL.a.611/2575283/tacl.a.611.pdf by guest on 11 March 2026Figure 5: Score distributions for grammatical and ungrammatical sentences from each English dataset.</p></note>
		</body>
		</text>
</TEI>
