<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Discourse on measurement</title></titleStmt>
			<publicationStmt>
				<publisher>National Academy of Sciences</publisher>
				<date>02/04/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10632542</idno>
					<idno type="doi">10.1073/pnas.2401229121</idno>
					<title level='j'>Proceedings of the National Academy of Sciences</title>
<idno>0027-8424</idno>
<biblScope unit="volume">122</biblScope>
<biblScope unit="issue">5</biblScope>					

					<author>Arthur Paul Pedersen</author><author>David Kellen</author><author>Conor Mayo-Wilson</author><author>Clintin P Davis-Stober</author><author>John C Dunn</author><author>M Ali Khan</author><author>Maxwell B Stinchcombe</author><author>Michael L Kalish</author><author>Katya Tentori</author><author>Julia Haaf</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Measurement literacy is required for strong scientific reasoning, effective experimental design, conceptual and empirical validation of measurement quantities, and the intelligible interpretation of error in theory construction. This discourse examines how issues in measurement are posed and resolved and addresses potential misunderstandings. Examples drawn from across the sciences are used to show that measurement literacy promotes the goals of scientific discourse and provides the necessary foundation for carving out perspectives and carrying out interventions in science.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>its goals, requirements, and problems. Among the most important of its problems is the problem of its justification, which we turn to first. Measurement literacy gives the wayfaring scientist the foundation to establish trade routes through uncharted seas, connecting landmarks with new lands. It also provides the scientist with means necessary for understanding and reasoning about known and new seaways and ports, and chronicling it all in a way that can be acted upon and believed. While so too is literacy in measurement crucial to charting passage through the high seas, seafaring is a dangerous business; currents change and landmarks sink. Risk and error abound at sea.</p><p>Many important questions about measurement are omitted from consideration in this discourse. For example, problems for the development and application of measurement methods and techniques are not covered here; problems for the design and deployment of instruments and tools of metrology are also beyond the scope of this discourse. Similarly, no attempt is made to address important but technical questions about the relationship between measurement and, say, statistical inference or causal discovery algorithms.</p><p>This discourse also omits a mathematical treatment of measurement. The study of measurement is already fortunate enough to possess an impressive library of technical treatises on the subject (see, e.g., refs. <ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref><ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref>. No one wants another one. While formal methods play an important part in the study of measurement, they do not define its problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Measurement, In Brief</head><p>To measure something is, in one way or another, to represent it. What is unfamiliar and perhaps unwieldy is represented by what is familiar and convenient. Historically, the real number system has spoken to the desire for familiarity and convenience in carrying out the business of science. It equips scientists with a powerful medium for transforming and communicating information (cf. <ref type="bibr">23, p. 60; 15, p. 50)</ref>.</p><p>Consider a set of rigid rods. Cursory inspection is sufficient to determine that some of the rods are longer than others when placed side-by-side. Associate to each rod a real number representing its length. What is obtained is a measurement scale.</p><p>The scale is hardly unique. A rod can be measured in inches, yards, or miles-or even, say, centimeters, meters, or kilometers. But not any assignment of numbers will do when it is length that is being measured. The scale for measuring length is but one from among a family of scales for the rigid rods related to each other by the type of requirements that length imposes on its representation.</p><p>What is unique-and what historically has been the subject of intense systematic study-is a measure's scale type. It is the common denominator, or defining property, among all representing measurement scales. For attributing length of the rigid rods, the common denominator requires ratios between every pair of rods to be invariant across all representing scales-and so the type of scale for measuring length is called a ratio scale. Each scale can be obtained from any other by a positive linear transformation-and so 2.57 centimeters rings up at 1 inch, 12 inches at 1 foot, and so forth. Thus, up to multiplication by a positive real-valued constant, the scale for measuring the length of rigid rods is unique. Put concisely, the measurement scale is unique up to choice of unit of measurement.</p><p>When it is not length being measured, but some other attribute, the requirements that the attribute's measurement imposes on its representation might change. Any scale obtained by measuring the attribute would therefore be subject to the requirements of a distinct scale type. Inspect the rods once more. Plain to the touch is that some rods feel warmer than others. Placing the rods in rank order of warmth forms an ordinal scale. Its common denominator is uniqueness up to any scale preserving the relative ordering of rods ranked by warmth.</p><p>When the attribute of interest is, say, the manufacturer date of rods, then any assignment of numbers for measuring dates of the rigid rods forms an interval scale. Uniqueness up to multiplication by a positive scalar and addition of a real-valued constant is the common denominator of any scale measuring the dates of the rods (fancy talk: up to positive affine transformation). And so on. Table <ref type="table">1</ref> summarizes the traditional classification of scale types credited to Stevens <ref type="bibr">(24)</ref> and subsequently developed extensively over the second half of the 20th century.</p><p>The scale type of an attribute like length is determined by abstract requirements that the attribute's measurement imposes on its representation. But what grounds can be given for ascribing a scale type to an attribute in the first place? What endows length with a ratio scale? The question over a scale type's justification is a burning question for the working scientist. We enter into thorny territory. Clear thinking will clear the way.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Measuring It</head><p>Track changes of liquid volume inside a mercury thermometer by tick markings along its side. Two entries are logged: one upward change from 30 &#8226; F to 31 &#8226; F and one downward change from 45 &#8226; F down to 44 &#8226; F. What grounds are required to claim that in the two cases the temperature has changed by the same amount? For that matter, what is the basis for the claim's presumption that one and the same attribute is being measured in the first place? These are hard questions to answer in any completely satisfactory way. But these questions, and others like them, are among those that working scientists deal with on a routine basis, whether they come to terms with them or not.</p><p>These questions, and others like them, concern the problem of measurement "validity" (e.g., ref. <ref type="bibr">25)</ref>, sometimes referred to as "nomic measurement" <ref type="bibr">(5)</ref> or "coordination" <ref type="bibr">(26)</ref>. This is the problem of justifying the existence and form of a functional relationship between indices obtained by performing a procedure and the magnitudes of an attribute purportedly being measured. It is to be distinguished from purely formal problems concerned with the mathematical description of measurement scales (e.g., ref. <ref type="bibr">18)</ref>.</p><p>It is obvious that trying to establish that some structured index is somehow the true measure for an attribute by direct verification is out of the question. In practice, coordination is justified through an iterative process that leverages various theoretical, practical, and empirical arguments against each</p><p>Table 1. Scale types, common admissible transformations, and examples Scale Transformation Examples Absolute x &#8594; x Relative frequency, count Ratio x &#8594; x, &gt; 0, real number Duration, length, mass, dosage, reaction rate, electric current Interval x &#8594; x + , &gt; 0, , real numbers Calendar date, temperature, potential energy, cardinal utility Ordinal x &#8594; (x), strictly increasing on real numbers Letter grades, triage rank, air quality, social dominance Nominal x &#8594; (x), bijection on real numbers Treatment groups, species, genotype other (e.g., refs. 5 and 27</p><p>). The historical case of thermometry provides a crisp illustration of how such a process can unfold <ref type="bibr">(5)</ref>.</p><p>A presumption of quantifiability requires reasons. The scientist bears the burden of establishing a basis for explaining how the attribute is, or could be, related to other established attributes or measurement practices, and in some cases, of demonstrating how its application contrasts with its use in ordinary language. To return to the case of length, there are many procedures for its measurement. The relationships among these procedures are well known by scientists. Physical theory specifies how they are related to other physical quantities such as acceleration. And so on.</p><p>But sometimes the conceptual issues are not so clear. Consider "extroversion"-or "extraversion," according to Carl Jung-one of the Big five factors of personality <ref type="bibr">(28)</ref>. What warrants its current numerical representation on a ratio scale? Looking at the ordinary-language understanding and use of the term, comparative statements such as, "Anna is more extroverted than Debbie," comport with its use in third-person ascriptions and first-person avowals (see refs. <ref type="bibr">29 and 30)</ref>. But a proclamation that "Anna is ten times more extroverted than Debbie" runs afoul of ordinary usage. The guidance provided by ordinary usage can be supplemented by introducing a technical definition of extroversion that enjoys value over and above tracking expressions typically attributed to extroversion, such as, "I like to go to parties," "I like people," and the like. In this way, cogent grounds for treating extroversion as a ratioscaled attribute or dimension might be established <ref type="bibr">(30)</ref><ref type="bibr">(31)</ref><ref type="bibr">(32)</ref><ref type="bibr">(33)</ref>.</p><p>To illustrate the design of a technical concept, consider the measurement of competitive ability among organismsfitness. Nebulous and tautological conceptions of fitness (see ref. <ref type="bibr">34</ref>, chapter 2) can be sharpened into a ratioscaled representation of relative population frequencies. This representation turns out to facilitate the derivation of well-known selection equations <ref type="bibr">(35)</ref> and the formulation of precise definitions of phenomena such as gene-gene interactions (for a detailed discussion, see ref. <ref type="bibr">36)</ref>.</p><p>Conceptual considerations bear on empirical merits of theories and models that postulate the measured attribute. The study of measurement has led to the identification of nontrivial constraints that can be used to put to the test the claim that a given attribute is amenable to a ratio-scale representation (e.g., refs. <ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref><ref type="bibr">[40]</ref><ref type="bibr">[41]</ref><ref type="bibr">[42]</ref><ref type="bibr">[43]</ref>. Some of these tests will be considered later on in the Error section. A well-known example from psychology is signal detection theory (SDT, 44). Its ratio-scaled attributes of discriminability and response bias have been validated by its ability to successfully describe and predict people's judgments in a wide variety of domains (e.g., recognition memory; see refs. <ref type="bibr">45 and 46)</ref>. Another well-known example, this time coming from the intersection of psychology and economics, is helpful to understand the difference between conceptual and empirical considerations. Prospect Theory (47) postulates a "loss aversion" attribute defined in terms of people's appreciation for lotteries including both gain and loss outcomes <ref type="bibr">(48,</ref><ref type="bibr">49)</ref>. The sharpness of this definition notwithstanding, Prospect Theory is often found to underperform relative to rival theories that do not include loss aversion as an attribute (e.g., ref. <ref type="bibr">50)</ref>.</p><p>Predictive success is generally acknowledged to be but one of many factors that can figure into a theory's support. This includes establishing the validity of measurements. Quantities that predict might not be measures of anything. To see this, consider a survey consisting of one question,"How many records by the Beach Boys have you purchased?" No doubt a pronounced correlation would be established between the response variable for the survey question and other quantities for attributes of survey subjects, like age or weight, and so could be used to predict negative health outcomes like arthritis, heart disease, or senility. Yet, it stands to reason that there is no such thing as the Beach-Boy-Health Index, however it is you stipulate the form of the correlation.</p><p>Next, turn to the assertion that ratings of life satisfaction on a 10-point scale are predictive of significant life events, such as quitting a job, ending a romantic relationship, and so on. It is on this basis that (3) make the striking claim quoted at the outset of this article, namely, that humans "sense within themselves-and can communicate-a reliable numerical scale for their feelings." This conclusion is unwarranted.</p><p>Also unwarranted is the presumption that a reliable correlation between two indices provides evidence that they measure the same thing. Consider, for example, using a balance to measure the mass of several stacks of identical cubical blocks. You will notice that the results are linearly related to the heights of those stacks as measured by a ruler. Yet rulers and balances measure different things. The same obviously applies to the indices coming from Beach-Boyfanhood and health surveys. Guttman summarized it best when stating that "Correlation does not imply content" <ref type="bibr">(51)</ref>. Being able to avoid this kind of unwarranted conclusions is crucial in the current era of Big Data, as the vast majority of correlations found in large datasets are spurious <ref type="bibr">(52)</ref>.</p><p>That said, Guttman's aphorism does not extend to "negative" claims that distinct indices measure different things. In fact, there is a long-standing practice in using contrastive measures (e.g., single and double dissociations in factorial experimental designs) in the localization of mental functions in the brain <ref type="bibr">(53)</ref>. But obtaining diagnostic results through this way can be easier said than done, as some conclusions can only be drawn when representing attributes in certain ways (e.g., refs. <ref type="bibr">[54]</ref><ref type="bibr">[55]</ref><ref type="bibr">[56]</ref>. This ambiguity will be discussed in greater detail in the upcoming Meaningfulness section.</p><p>At least in the social and behavioral sciences, the lineage of some of the unwarranted claims discussed so far can be traced back to the objective of construct validity as popularized by Cronbach and Meehl in the 1950s (for reviews, see refs. <ref type="bibr">25 and 57)</ref>. Constructs are at once conceived to be abstractions (that "describe," "summarize," "encapsulate") and themselves objects for scientific inquiry (to be "detected," "explored," or "manipulated;" see refs. <ref type="bibr">[58]</ref><ref type="bibr">[59]</ref><ref type="bibr">[60]</ref>.</p><p>Construct validity is attractive to because it speaks to scientists' general desire for measures to be as theory-agnostic as possible. But the extent to which measurements are theory-laden is sometimes a point of contention (e.g., the famous Koopmans-Vining debate of the 1940s; see ref. <ref type="bibr">61)</ref>. Regardless, there is a real risk for this desire to devolve into a poor understanding of the measures being used, which can lead both scientists and policymakers astray and likely lead to social harm.</p><p>Recent developments in the study of eyewitness identification help to illustrate the stakes. For several decades now, researchers and policy-makers have been preoccupied with the effectiveness of lineup procedures performed by police departments and their contribution to the risk of wrongful conviction. Numerous studies comparing different lineup procedures, using measurements of so-called "probative value," drew the conclusion that lineups formed sequentially are superior to lineups in which individuals are presented side-by-side <ref type="bibr">(62)</ref>. In time, these findings shaped the guidelines issued by the U.S. Department of Justice (see ref. <ref type="bibr">63</ref>, p. 594), and by 2013, one-third of U.S. police precincts used sequential lineups <ref type="bibr">(64)</ref>. Later work revealed the theoretical and empirical shortcomings of measurements of "probative value" (e.g., refs. 64 and 65). Crucially, subsequent research relying on SDT-based measures overturned the claimed superiority of sequential lineups <ref type="bibr">(64)</ref>.</p><p>A firm grasp of what goes into justifying measurement claims also helps scientists to have a clear reading on what can be properly justified. In the same way that the evidence obtained can only go so far in pinning down the details of existing theories, there are limitations when it comes to establishing a scale type. How to navigate these challenges will be the focus of the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Meaningfulness</head><p>Measurement literacy helps scientists in detecting misleading, imprecise, or unsubstantiated quantitative claims. It provides standards for scientific discourse and in particular for meaningful communication-a technical qualification to be given sense in what follows.</p><p>Consider the striking statement about happiness from the outset of this discourse, according to which residents of lower-income countries who are given USD $10,000 gained "three times more happiness than those in higher-income countries" (2). That statement is striking, not because it implies happiness is amenable to scientific examination, nor because it implies that happiness can be measured and quantified. It is striking because it presumes that happiness admits quantification on a ratio scale. That would mean that happiness is a quantity like length.</p><p>While various types of evidence might be proffered to ground the claim that happiness conforms to a ratio scale, no such evidence exists as of the writing of this discourse. There is no evidence that the many measures of happiness tabulated by Veenhoven <ref type="bibr">(66)</ref> are related by positive affine transformations. Even analytic arguments are wanting; it is hardly self-evident that happiness is a singular attribute admitting representation on a ratio scale, and one that can be compared between two groups of individuals or even by one individual at different moments of life (cf. 67). These challenges are not unique to the study of happiness and confront attempts to quantify well-being at individual and societal levels more widely (e.g., ref. <ref type="bibr">68)</ref>.</p><p>If happiness fails to be ratio-scaled, then the striking statement's publication has the potential to mislead even those with the best intentions. Consider a scientist acting as advisor for a policy-maker. Suppose the policy-maker wants to redistribute economic aid between two countries. Based upon the reporting in ref. <ref type="bibr">2</ref>, the scientist would reasonably assume any reduction in happiness in the former country would be outweighed by a corresponding three-fold increase in happiness in the latter.</p><p>Yet, should happiness fail to be quantifiable on a ratio scale, follow-up studies that quantify happiness by "rescaling" the unit of measure might very well find that the implementation of the policy-maker's program to be a foreign policy failure of the third degree, not only because it did not increase happiness three-fold in the target country but also because it drove down overall happiness in the target region. The policy-maker's predicament is not out of the ordinary; consider, for example, policy planning and evaluation in environmental conservation <ref type="bibr">(69,</ref><ref type="bibr">70)</ref> or border security <ref type="bibr">(71)</ref>.</p><p>The statement runs afoul of a standard that scientific publications are ordinarily expected to observe as harbingers of knowledge. More specifically, the statement about happiness from the outset of this discourse falls short of the technical criterion of meaningfulness, because, in fact, its truth value can vary with the scale that is chosen for measuring happiness. In rough yet somewhat abstract terms, a statement is said to be meaningful just in case its truth or falsity is independent of what scale is chosen to measure the target attribute from among those related by scale type. For example, the assertion that diamond is onehundred times harder than gold is not meaningful. Although the assertion may be true given one scale for hardness (e.g., Knoop's), it is false on others (e.g., Moh's). Similarly, the statement that a patient's temperature at noon is 2% higher than it was at noon yesterday is not meaningful.</p><p>Scientists and policymakers alike trust that scientific publications implement reliable checks and balances protecting against promulgation of shaky science. A published hypothesis that fails to be meaningful can fail to replicate if the measurement scale that is used is different from the one used for the publication-and so threatens to mislead scientists and policymakers and thereby needlessly expose the public and even the international order at large to harm and discord. Yet a published hypothesis that successfully replicates can fail to be meaningful, a situation that, if handled inappropriately, can propagate and perpetuate misconceptions. Measurement literacy is requisite for scientists to be responsible and trusted ambassadors of knowledge.</p><p>The difference between statements that are meaningful and those that are not can be subtle. One subtlety concerns the "problem of coordination" discussed earlier in the Measuring It section: the unknown functional relationship between an index and the attribute that it is purported to measure. Because of this gap, a statement about the index can be meaningful at the same time that a counterpart statement about the attribute is not.</p><p>Consider a memory experiment where participants study a list of words under two learning regimens-call them "high learning" and "low learning." Later, after different retention intervals-"short" for some and "long" for others-they are asked to recall what they have learned. The average recall rates obtained with this 2 &#215; 2 factorial design, illustrated in the Right panel of Fig. <ref type="figure">1</ref>, show an interaction effect in the ANOVA sense. More specifically, the difference between effects of short and long retention intervals on recall is smaller in the high-learning condition (0.16 difference) than it is in the low-learning condition (0.30 difference). This difference measure lies on an absolute scale, and so, is unique in the strongest sense that it is the only scale belonging to its scale type (Table <ref type="table">1</ref>). Therefore, there is no problem with statements like, "Recall rates decreased faster in the low-learning condition than they did in the highlearning condition."</p><p>Broadly speaking, recall rates are of interest to cognitive scientists because they help to illuminate the cognitive processes associated with mnemonic faculties (e.g., refs. 72 and 73). But while scientists are generally willing to postulate "memory strength" attributes belonging to an interval scale type (e.g., refs. 45, 74, and 75), presuming anything more than an ordinal scale about their relationship with something like recall rates would be contentious at best <ref type="bibr">(76;</ref> but see also refs. <ref type="bibr">77 and 78)</ref>.</p><p>Hence, in the present case, to say that some memorystrength attribute decreased faster in the low-learning condition than it did in the high-learning condition would fail to be, by definition, meaningful. Fig. <ref type="figure">1</ref> illustrates why. The center panel shows that outcome of nonlinear monotonic transformation of memory strengths into recall probabilities. In turn, the Left panel illustrates the average memory strengths obtained across the different experimentalmanipulation conditions. The effects of these experimental manipulations are additive (i.e., there is no interaction). But when transformed into recall probabilities, these effects are no longer additive (i.e., there is an interaction). Recent literature reviews in psychology show that researchers are generally unaware of these subtleties <ref type="bibr">(79)</ref> and are often found drawing conclusions that fail to be meaningful <ref type="bibr">(80)</ref>. In part, there seems to be a confusion between the replicability of an outcome, such as the interaction effect in Fig. <ref type="figure">1</ref>, and the meaningfulness of the measurement statements surrounding it. In truth, successfully replicating an effect has no bearing on the meaningfulness of a measurement statement (see refs. <ref type="bibr">64 and 81)</ref>.</p><p>Inattention to measurement basics can compromise effective use of aggregate indices in scientific discourse and policy-making. Statements that compare arithmetic means of measures for an attribute on an ordinal scale might fail to be meaningful if the rank ordering given by the arithmetic means fails to be preserved by some nonlinear strictly increasing transformation of the scale. Thus, ranking research grant applications by their averages might depend on the choice of ordinal scale used for rating them. By contrast, using geometric means or medians would be independent of the choice of ordinal rating scale, as ranking by means would be preserved by any nonlinear strictly increasing transformation of rating scale <ref type="bibr">(82,</ref><ref type="bibr">83)</ref>. In this context, measurement literacy affords critical insight into the conditions and uses for combining measures to direct intelligent reasoning and decision-making (for a detailed discussion, see ref. <ref type="bibr">84)</ref>. Measurement literacy likewise provides guidance for using parametric and nonparametric methods in hypothesis testing <ref type="bibr">(83)</ref><ref type="bibr">(84)</ref><ref type="bibr">(85)</ref>.</p><p>While we have been content in this discourse with an informal treatment of the criterion of meaningfulness in measurement, there is an extensive literature giving rigorous treatment to its precise formulation <ref type="bibr">(13,</ref><ref type="bibr">15,</ref><ref type="bibr">16,</ref><ref type="bibr">20,</ref><ref type="bibr">(86)</ref><ref type="bibr">(87)</ref><ref type="bibr">(88)</ref><ref type="bibr">(89)</ref><ref type="bibr">(90)</ref><ref type="bibr">(91)</ref><ref type="bibr">(92)</ref><ref type="bibr">(93)</ref>. Sometimes treatments of meaningfulness use the term "meaningless" for statements that fail to be meaningful in the technical measurement-theoretic sense. We have refrained from this terminology in view of the historical use of the word and its variants by logical positivists as a slur.</p><p>A more controversial aspect of meaningfulness concerns its relationship with statistical inference. When proposing his famous classification of scale types shown in Table <ref type="table">1</ref>, Stevens argued that the scale of one's data determined which statistical methods are "permissible." According to Stevens' doctrine, the analysis of nominal and ordinal data requires nonparametric methods (provided that they are available, which is not guaranteed for experimental designs of even modest complexity; e.g., see ref. <ref type="bibr">55)</ref>, whereas the analysis of interval and ratio-scaled data permits use of parametric methods <ref type="bibr">(21,</ref><ref type="bibr">(94)</ref><ref type="bibr">(95)</ref><ref type="bibr">(96)</ref>. Since its inception, Stevens' doctrine has been subject to intense debate. Its critics live by Lord's Word, "the numbers don't know where they came from" and therefore deny that measurement scale types place any limitations on statistical inference whatsoever. To put it differently, whether or not certain numerical assignments are measures of anything has no bearing on their legitimacy as data <ref type="bibr">(97)</ref><ref type="bibr">(98)</ref><ref type="bibr">(99)</ref><ref type="bibr">(100)</ref><ref type="bibr">(101)</ref>.</p><p>The rigidity of this doctrine is hard to miss. Take intelligence measurement. Consider the hypothesis that the distributions of intelligence of two groups differ. Assume that intelligence admits quantification by scores on an ordinal scale and that the measurement scale is valid for the different groups of individuals that it is applied to. Under these assumptions, nonparametric tests can be used to evaluate the differences between the distributions, on the basis of which the conclusion may be validly drawn that the distributions do or do not differ, as the case may be. But, in general, it is invalid to conclude that, on average, one group is more or less intelligent than another group on the basis of differences in the arithmetic means. For such a conclusion to be meaningful, the scale must have properties stronger than being ordinal (see ref. <ref type="bibr">102)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Design</head><p>One subtlety in the memory example discussed earlier (Fig. <ref type="figure">1</ref>) is that effects like interactions do not stand on their own. Rather, their standing depends on all the other effects that, together, characterize a data pattern (e.g., ref. <ref type="bibr">103</ref>).</p><p>In the case of ANOVA, it turns out that the claim that the smallest of its effects (main effects, interactions) exists (i.e., that it is nonzero) does not hold true across all possible ordinal scales. In other words, it fails to be meaningful (see ref. <ref type="bibr">104</ref>). The immediate implication from this insight is that the meaningfulness of an ANOVA effect can be guaranteed by making sure that it is not the smallest one. This can be achieved by fashioning the design of the experiment accordingly. For instance, one could select the experimental factors, say the learning regimen, so that one of the main effects is the smallest.</p><p>This example illustrates one of the many ways in which measurement literacy can contribute to the development of successful study design and successful research programs more generally. Yet it is our contention that scientists generally underestimate the importance of deliberating about measurement in study design.</p><p>Take the case of random sampling (whether simple or stratified). Textbook introductions discuss results showing that random sampling guarantees unbiased estimators for many quantities of interest, such as population means, population totals/sums, and differences between group means <ref type="bibr">(105,</ref><ref type="bibr">106)</ref>. But these results, as useful as they might be, are limited in scope. They apply only to quantities-and of a specific type, namely, means and sums, rather than medians, minima, or maxima, say. If the measurement scale type of the attribute of interest is at best ordinal, then sums and averages are unlikely to be quantities of direct interest. But changing estimates of interest is not as simple as it looks, for the simple reason is no sampling methodology that can guarantee their unbiased estimation <ref type="bibr">(107)</ref>.</p><p>Just as random sampling guarantees that there is an unbiased estimator of the population mean, so too do randomized experiments guarantee that there are unbiased estimators of average-treatment effects <ref type="bibr">(108)</ref>. But again, the focus on quantities and on mean treatment effects, in particular, is critical. If the outcome of interest is not a quantity at all, or if it is a merely ordinal or nominal variable, then a researcher might be interested in some other form of a causal effect that is not best estimated via a randomized experiment. This lesson is important because there is a growing trend to prioritize randomized experiments over observational studies in the social and biomedical sciences <ref type="bibr">(108)</ref>.</p><p>For examples of how measurement-scale considerations can inform study design, look no further than to applications of model-based sampling. This consists of using domainspecific knowledge to build a statistical model of the population and then choose a sampling scheme that permits the estimation of the parameters of that model. This approach has been successfully employed in agriculture, medicine, and ecology, among other fields (e.g., see refs. <ref type="bibr">109 and 110)</ref>.</p><p>Imagine someone trying to plan a new study on the impact of exposure to lead on IQ <ref type="bibr">(1)</ref>. A model-based approach might use past data suggesting that the relationship between blood-lead levels and IQ is roughly log-linear <ref type="bibr">(111)</ref>. The regression coefficient in that model could then be used to estimate total IQ loss. If the goal is to estimate that coefficient (and hence, total IQ loss) precisely, one should systematically sample Americans with the highest bloodlead levels, not sample randomly (see ref. <ref type="bibr">112)</ref>.</p><p>This sampling solution would need to be reconsidered though, if the goal is to rely on IQ to make claims about intelligence. Given that IQ is at best an ordinal index of intelligence <ref type="bibr">(113)</ref>, hypotheses about population medians or minima might be of greater interest to researchers than hypotheses about population means or sums, as the former involve meaningful measurement claims whereas for the latter, only in specific instances. In this context, scientists would benefit from considering statistical models (e.g., ordinal regression) that will help them make meaningful estimates with respect to the appropriate measurement scale type.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Error</head><p>Measurement literacy is crucial for navigating measurement errors intelligibly, and in turn, leveraging them effectively in testing and developing scientific theories.</p><p>Talk of error is present in pretty much every area of science. But error talk does not speak for itself-it requires some standard to be in place. The proverbial table leg proclaimed to be off by one inch is presumed to have some definite length by which the measure errs. But what of the case where the length of a table leg is measured multiple times throughout the day? To attribute error to each recorded measurement is to presume one length of the leg. Nothing stands in the way of interpreting any and all discrepancies to be true expressions of the attribute's "natural variation"-that is, to impute many lengths to the leg. True, nothing does; but pursuing this would not be helpful <ref type="bibr">(114)</ref>.</p><p>The table leg example illustrates an important insight, namely, that in any context in which measurement is said to take place, there are accepted background assumptions that set general rules on how to attribute error <ref type="bibr">(115)</ref>. To be clear, errors of measurement are not exceptions to the general rules set; they are stipulations about the veracity or fidelity of observational reports. Theories include assumptions (e.g., Newton's third law of motion, the law of linear thermal expansion) which stipulate that any observational report at odds with them is in error. To fix ideas, consider a case where leg a is reported to be longer than leg b, followed by another report that leg b is longer than leg a. These two reports are only at odds with each other, with at least one of them in error, if one assumes that (i) length is invariant over time and (ii) that the relationship expressed by "longer than" is an asymmetric relation. For more vivid examples of error adjudication, consider the field of paleontology, where models are routinely relied on to correct or debias fossil records <ref type="bibr">(116)</ref>.</p><p>Distinct theories rest on different assumptions and therefore might very well disagree over what counts as an error. Consider how classic and quantum theories differ over the way in which attributes are treated and their relationship with measurement procedures. Classical physics frames measurement as a process in which the true magnitude of an attribute becomes known. In turn, quantum theory states that there is no true measure by which we err-magnitudes are "created" by the taking of measurements <ref type="bibr">(117)</ref>. Outcomes incompatible with one theory's assumptions are acceptable according to the other. In other words, the appeals to error made by the two theories are very different.</p><p>Measurement errors are obtained when engaging in a process of reconciliation between the observations and the assumptions being upheld <ref type="bibr">(118)</ref>. Return to the table leg example. Assuming that its length is the same at all times that measurements were taken, then errors can be estimated by adjusting each recorded value so that they perfectly agree on some quantity L. Here, L is no longer being treated as directly observable, but as a nonobservable quantity whose estimation is a function of the reconciliation process (e.g., minimization of squared errors; see refs. <ref type="bibr">114, 118, and 119)</ref>.</p><p>The quantification of error provided by such a reconciliation provides important grist for the intellectual mill. When comparing distributions of errors, some might be perceived as negligible or tolerable, whereas others might be too large or systematic to be left wanting for an explanation. Take the case of Laplace's study of the solar system, in particular the motions of Jupiter and Saturn.</p><p>When observing irregularities, Laplace weighed the merits of attributing them to error vis-&#224;-vis unaccounted causes. When errors were deemed too large, he would pursue the latter account. In the cases of Jupiter and Saturn, Laplace was able to explain the observed irregularities in their motion by appealing to the mutual gravitational attraction of the two planets <ref type="bibr">(120)</ref>.</p><p>This appraisal of theories and hypotheses through the quantification of error brings us to a point touched upon earlier in the Measuring It section, namely the possibility of conducting tests that speak to the basal hypothesis that a given attribute is amenable to a ratio-scale representation. For the longest time, the deployment of these tests was frustrated by a deficit of work integrating errors into statistical-inferential machinery (be it frequentist, Bayesian, or whatever). But without the possibility of error, a single recalcitrant observation is enough to undermine the presumed measurability of an attribute. Fortunately, recent developments have resolved many of these challenges, creating new opportunities for theory testing and development (e.g., refs. <ref type="bibr">41, and 121-123)</ref>.</p><p>The success of theories can be determined by testing the constraints imposed by the attributes that it postulates. Take the notion of strength of preference or utility that underlies a large family of theories, including notable members such as Prospect Theory <ref type="bibr">(47)</ref>. According to this family of theories, people's preferences conform to a number of constraints, the requirement of transitivity being one of them <ref type="bibr">(124)</ref><ref type="bibr">(125)</ref><ref type="bibr">(126)</ref>. Under the appropriate experimental designs, the different constraints that preferential choices must satisfy become very strict, introducing the possibility of strong-inference testing <ref type="bibr">(127)</ref>. To reject these constraints is to reject a large family of theories altogether <ref type="bibr">(124)</ref><ref type="bibr">(125)</ref><ref type="bibr">(126)</ref>.</p><p>One attractive feature of these tests is that they offer the possibility to cast routine hypotheses under weaker scaling assumptions. For instance, instead of assuming a linear relationship between experimental factors and the data, as done when using off-the-shelf methods such as ANOVA, one can merely assume that there is a monotonic relationship (see ref. <ref type="bibr">55)</ref>. This reliance on weaker assumptions such as monotonicity directly addresses concerns with the meaningfulness of effects, as illustrated earlier in our memory example (Fig. <ref type="figure">1</ref>).</p><p>In some cases, due care in the handling of errors includes offering a principled distinction between measurement error and the natural variability or stochasticity of attributes <ref type="bibr">(128,</ref><ref type="bibr">129)</ref>. As an illustration, consider a scenario where a person expresses a preference for a over b at a given point, but later claims to prefer b over a. These discrepant observations can be plausibly said to reflect a change in that person's (true) preferences (e.g., refs. <ref type="bibr">124 and 130)</ref>. Now, contrast this scenario with the earlier table leg example. Because length is presumed to be an invariant attribute in most applications, one should expect an analogous set of observations to be attributed to the presence of error (e.g., ref. <ref type="bibr">119)</ref>.</p><p>Failing to acknowledge the errors in our ways can lead to mischaracterizations (e.g., refs. <ref type="bibr">130 and 131)</ref>. But scientists can also fail to keep error in its place. Take the widespread practice in the social, behavioral, and health sciences of gathering so-called measurements of human feelings. Despite (3)'s claims that people somehow appear to reliably operationalize their feelings over numerical scales, there is no sense in which these measurements can be in error. Can someone be said to be mistaken about how sad they are, or about how much their headaches? Leaving aside cases of self-deception, and regardless of the fickleness of these feelings and sensations, the answer is arguably in the negative (for discussions, see ref. <ref type="bibr">132,</ref><ref type="bibr">Chapters 5 and 8)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusion</head><p>Measurement literacy is crucial for effectively navigating and advancing scientific discourse. A working understanding of its problems, requirements, and goals affords the working scientist with the foundation necessary for thinking things through, from problems in validity, inference, experimental design, and error to policy-making and communication.</p><p>In recent years, discourse in science, especially in the social and behavioral sciences, has weathered numerous crises, from the reproducibility crisis (e.g., ref. <ref type="bibr">133)</ref> to the theory crisis (e.g., ref. <ref type="bibr">134)</ref>, as well as myriad attempts to address them, from mandating preregistration to calls for more theory-driven hypothesizing. To address such crises and evaluate proposals to address them, it is necessary for measures to be taken to reinvigorate measurement literacy in discourse on science.</p><p>This discourse is a call to action.</p><p>Data, Materials, and Software Availability. There are no data underlying this work.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>To whom correspondence may be addressed. Email: app@arthurpaulpedersen.org.Published January</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>27, 2025. PNAS 2025 Vol. 122 No. 5 e2401229121 https://doi.org/10.1073/pnas.2401229121 1 of 9 Downloaded from https://www.pnas.org by 74.79.42.171 on August 31, 2025 from IP address 74.79.42.171.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>of 9 https://doi.org/10.1073/pnas.2401229121 pnas.org Downloaded from https://www.pnas.org by 74.79.42.171 on August</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>31, 2025 from IP address 74.79.42.171.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4"><p>PNAS 2025 Vol. 122 No. 5 e2401229121 https://doi.org/10.1073/pnas.2401229121 3 of 9 Downloaded from https://www.pnas.org by 74.79.42.171 on August 31, 2025 from IP address 74.79.42.171.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_5"><p>of 9 https://doi.org/10.1073/pnas.2401229121 pnas.org Downloaded from https://www.pnas.org by 74.79.42.171 on August 31, 2025 from IP address 74.79.42.171.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_6"><p>PNAS 2025 Vol. 122 No. 5 e2401229121 https://doi.org/10.1073/pnas.2401229121 5 of 9 Downloaded from https://www.pnas.org by 74.79.42.171 on August 31, 2025 from IP address 74.79.42.171.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_7"><p>of 9 https://doi.org/10.1073/pnas.2401229121 pnas.org Downloaded from https://www.pnas.org by</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_8"><p>74.79.42.171 on August 31, 2025 from IP address 74.79.42.171.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_9"><p>PNAS 2025 Vol. 122 No. 5 e2401229121 https://doi.org/10.1073/pnas.2401229121 7 of 9 Downloaded from https://www.pnas.org by 74.79.42.171 on August 31, 2025 from IP address 74.79.42.171.</p></note>
		</body>
		</text>
</TEI>
