<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Stealth Literacy Assessments via Educational Games</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10432764</idno>
					<idno type="doi">10.3390/computers12070130</idno>
					<title level='j'>Computers</title>
<idno>2073-431X</idno>
<biblScope unit="volume">12</biblScope>
<biblScope unit="issue">7</biblScope>					

					<author>Ying Fang</author><author>Tong Li</author><author>Linh Huynh</author><author>Katerina Christhilf</author><author>Rod D. Roscoe</author><author>Danielle S. McNamara</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Literacy assessment is essential for effective literacy instruction and training. However, traditional paper-based literacy assessments are typically decontextualized and may cause stress and anxiety for test takers. In contrast, serious games and game environments allow for the assessment of literacy in more authentic and engaging ways, which has some potential to increase the assessment’s validity and reliability. The primary objective of this study is to examine the feasibility of a novel approach for stealthily assessing literacy skills using games in an intelligent tutoring system (ITS) designed for reading comprehension strategy training. We investigated the degree to which learners’ game performance and enjoyment predicted their scores on standardized reading tests. Amazon Mechanical Turk participants (n = 211) played three games in iSTART and self-reported their level of game enjoyment after each game. Participants also completed the Gates–MacGinitie Reading Test (GMRT), which includes vocabulary knowledge and reading comprehension measures. The results indicated that participants’ performance in each game as well as the combined performance across all three games predicted their literacy skills. However, the relations between game enjoyment and literacy skills varied across games. These findings suggest the potential of leveraging serious games to assess students’ literacy skills and improve the adaptivity of game-based learning environments.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Literacy can be broadly defined as "the ability to understand, evaluate, use, and engage with written texts to participate in society, to achieve one's goals, and to develop one's knowledge and potential" <ref type="bibr">[1]</ref>. Literacy skills are critical to academic success and in life; however, large-scale reading assessment data reveal that many students and adults struggle with reading comprehension. The most recent National Assessment of Educational Progress reported that 27% of 8th grade students in the United States performed below the basic levels of reading comprehension and 66% did not reach proficient levels. Similarly, 30% and 63% of 12th graders did not reach basic and proficiency reading levels, respectively <ref type="bibr">[2]</ref>. As might be expected, deficits in reading skills often continue into adulthood. According to the 2017 Program for the International Assessment of Adult Competencies assessment, 19% of U.S. adults aged 16 or older performed at or below the lowest literacy level <ref type="bibr">[3]</ref>.</p><p>Literacy assessments are a key component of any effort to improve students' literacy or remediate potential gaps in education. A good understanding of learners' current skills reveals the types and amounts of instruction they will need to grow. However, traditional literacy assessments (e.g., standardized tests) typically require a significant amount of time to administer, score, and interpret. Additionally, these tests usually occur before or after learning, making it difficult to provide timely feedback to guide teaching and learning <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref>. Moreover, traditional assessments may cause stress and test anxiety, which may in turn negatively impact students' test-taking experiences <ref type="bibr">[5,</ref><ref type="bibr">7]</ref>, and change how students respond to the assessments <ref type="bibr">[8]</ref>.</p><p>In contrast to traditional literacy assessments, stealth assessment offers an innovative approach by implementing literacy assessments in computer-based learning environments. These assessments take place during learning activities, instead of summative or "checkpoint" assessments. In addition, the assessments are based on students' natural behaviors and performance rather than being presented as "tests." As such, stealth literacy assessments can evaluate student reading skills unobtrusively and dynamically, and provide timely feedback throughout the learning process. In this innovative context, serious games have strong potential to be more motivating and enjoyable than traditional reading assessments. Thus, in this study, we investigated the feasibility of game-based stealth assessment to predict literacy skills, specifically reading comprehension and vocabulary knowledge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1.">Stealth Assessment within Games</head><p>Stealth assessment refers to performance-based assessments that are seamlessly embedded in gaming environments without the awareness of students who are being evaluated <ref type="bibr">[4,</ref><ref type="bibr">9]</ref>. Stealth assessment was initially proposed and explored to assess higher-order competencies such as persistence, creativity, self-efficacy, openness, and teamwork, primarily because these competencies substantially impact student academic achievement, but also because traditional methods of assessment often neglect these abilities <ref type="bibr">[10,</ref><ref type="bibr">11]</ref>. As such, stealth assessments that analyze how students use knowledge and skills during gameplay have been embedded in serious games to unobtrusively assess those competencies <ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref>. Game environments may make assessment less salient or less visible, and thus students feel that they are merely "playing" rather than "being tested".</p><p>In stealth assessment, traditional test items are replaced by authentic, real-world scenarios or game tasks. Since stealth assessment items can be contextualized and potentially connected to the real world, students' skills, behaviors, and competencies may be more validly demonstrated through these game activities than in traditional assessments <ref type="bibr">[16,</ref><ref type="bibr">17]</ref>. Within stealth assessments, students generate rich sequences of performance data (e.g., choices, actions, and errors) when they perform the tasks, which can serve as evidence for knowledge and skills assessment. When students are assessed without the feeling of being tested, it can reduce their stress and anxiety, which can in turn increase the reliability of the assessment <ref type="bibr">[18,</ref><ref type="bibr">19]</ref>.</p><p>Serious games in education are designed to enhance students' learning experience by providing a more fun way to acquire knowledge <ref type="bibr">[20,</ref><ref type="bibr">21]</ref>, which could be ideal for stealth assessments because they further separate students' experiences of play and enjoyment from experiences of testing and measurement. In a well-designed game, students are immersed in game scenarios and motivated to proceed through the challenges and meet learning goals, which might not feel like a learning or testing experience at all <ref type="bibr">[22]</ref>. For example, Physics Playground is a game that emphasizes 2-D physics simulations. The game implements stealth assessments to evaluate students' physics knowledge, persistence, and creativity <ref type="bibr">[15,</ref><ref type="bibr">23]</ref>. When students interact with the game, they produce a dense stream of performance data, which is recorded by the system in a log file and analyzed using Bayesian networks to infer students' knowledge and skills. The system then provides ongoing automated feedback to teachers and/or students, based on the assessment, to support student learning. Another example of stealth assessment is a game-based learning environment named ENGAGE that was designed to promote computational thinking skills. Students' behavioral data were collected during their gameplay and then analyzed using machine learning methods to infer their problem-solving skills and computational knowledge <ref type="bibr">[13,</ref><ref type="bibr">24]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Stealth Reading Assessment via Games</head><p>Prior studies have explored stealth assessment via games to assess students' higherorder skills and competencies, such as problem solving, creativity, and persistence, along with scientific knowledge <ref type="bibr">[14,</ref><ref type="bibr">17,</ref><ref type="bibr">23,</ref><ref type="bibr">25]</ref>. Only a few studies have investigated using stealth assessment to assess literacy <ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref>. These studies primarily leveraged natural language processing (NLP) techniques to extract linguistic properties of constructed responses (e.g., essays and explanations) to make predictions about students' reading skills. For example, Allen and McNamara analyzed the lexical properties of students' essays to predict students' vocabulary test scores. Two lexical indices associated with the use of sophisticated and academic word use accounted for 44% of the variance in vocabulary knowledge scores <ref type="bibr">[26]</ref>. Fang et al. predicted students' comprehension test scores using the linguistic properties of the self-explanations generated during their practice in a game-based learning environment. Five linguistic features accounted for around 20% of the variance in comprehension test scores across datasets. The studies collectively demonstrate linguistic features at multiple levels of language (e.g., lexical, syntactic, and semantic) have strong potential to serve as proxies of reading skills, supporting the feasibility of stealth reading assessment using NLP <ref type="bibr">[27]</ref>.</p><p>However, the use of NLP for reading assessment is not without challenges. These methods rely on machine learning algorithms, which require a large amount of data to train and test to improve the analysis accuracy <ref type="bibr">[29]</ref>. Additionally, some NLP methods rely on extensive computational resources to support the complex calculations, which might be difficult to adopt by development teams without those resources <ref type="bibr">[29]</ref>. Most importantly, NLP is language-specific, rendering it challenging to generalize algorithms across languages.</p><p>An alternative to NLP is the analysis of students' performance data from games implementing multiple-choice questions. Multiple-choice questions provide a shortlist of answers for students to choose from, which do not require complex data analysis to evaluate students' answers. For example, Fang et al. investigated the association between students' reading skills and their performance in a single vocabulary game (i.e., Vocab Flash). The analysis was based on students' performance data from the game implementing what are essentially multiple-choice questions in disguise. The results of the study supported the value of using a simple vocabulary game to assess reading comprehension <ref type="bibr">[30]</ref>.</p><p>The current study examines both vocabulary and main idea games. In the following section, we introduce how literacy skills are reflected by the ability to identify word meaning and text main ideas, providing the theoretical grounds for leveraging vocabulary and main idea games to assess reading skills.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">Assessing Reading Skills through Vocabulary and Main Idea Games</head><p>Reading skills are at least partially reflected by students' vocabulary knowledge and ability to identify main ideas of passages. Readers must be able to process the basic elements of the text, including the individual words and the syntax, to understand and gain meaning from texts. From those elements, the reader can construct an understanding of the meanings behind phrases and sentences. Readers with more vocabulary knowledge tend to have better reading comprehension skills <ref type="bibr">[31,</ref><ref type="bibr">32]</ref>. When a reader is unfamiliar with certain keywords in a text, this can slow down or fully impair the processing of key points in the text. Although in some cases the meaning of words can be understood via contextual cues, comprehension becomes more challenging for readers with lower vocabulary knowledge <ref type="bibr">[33]</ref>. This effect is exacerbated when the reader also has insufficient prior knowledge about the topic through which to build context <ref type="bibr">[34]</ref>. Hence, when assessing texts for readability, educators and researchers have found that texts containing many low-frequency and sophisticated words make them more difficult to read <ref type="bibr">[35]</ref>. In secondlanguage learners, vocabulary is one of the most critical factors in determining how well students can comprehend texts <ref type="bibr">[36,</ref><ref type="bibr">37]</ref>.</p><p>Comprehending text requires not only knowledge of individual word meanings, but also the skills required to deduce relations between ideas and, in turn, the main ideas <ref type="bibr">[38]</ref>. Identifying topic sentences is a comprehension strategy that requires students to recognize the main ideas within a text while dismissing information that is irrelevant or redundant <ref type="bibr">[39]</ref>. Being able to distinguish main versus supporting ideas can foster deep comprehension because it encourages students to attend to the higher-level meaning and the global organization of information across texts <ref type="bibr">[40]</ref>. Consequently, Bransford et al. suggested that identifying main ideas within a text can lead to enhanced comprehension and retention of the text content <ref type="bibr">[40]</ref>. According to Wade-Stein and Kintsch <ref type="bibr">[41]</ref>, this task not only promotes students' construction of factual knowledge, but also of conceptual knowledge, as the process of identifying main ideas within a text reinforces students' memory representations of its content.</p><p>Deducing the gist of a text can be challenging <ref type="bibr">[42,</ref><ref type="bibr">43]</ref>. Low-knowledge readers can find it hard to differentiate between the main arguments and supporting ideas of a text <ref type="bibr">[44]</ref>. Likewise, Wigent found that students with reading difficulties focused more on details rather than identifying main ideas, subsequently recognizing and recalling fewer topic sentences compared to average readers <ref type="bibr">[45]</ref>. By contrast, strategic and skilled readers are more likely to grasp the main ideas from text compared to less skilled readers <ref type="bibr">[46]</ref><ref type="bibr">[47]</ref><ref type="bibr">[48]</ref><ref type="bibr">[49]</ref>.</p><p>Students' reading comprehension skills, vocabulary knowledge, and ability to recognize main ideas are closely associated. Therefore, this study employed three distinct games that emphasized vocabulary and main idea identification. First, Vocab Flash is a game that requires players to select appropriate synonyms for words. It is an adaptive game designed to measure vocabulary for variously skilled participants, making it ideal for stealth assessment of vocabulary. Second, Adventurer's Loot is a game that asks players to read a text and select the main ideas. Participants must be careful to select only the main ideas, and not any extraneous details. Finally, Dungeon Escape asks players to read a passage, and imagine they are about to write a summary of the passage. They must select the best topic sentence for the summary. To pick the best topic sentence, participants must locate and integrate the main ideas of the passage. Such integration may require further reading comprehension skill than Adventurer's Loot, which is why two main idea games are included.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4.">Current Study</head><p>The goal of the current study is to investigate the feasibility of using games for stealth assessment of reading skills. Specifically, this study examines the predictive value of three distinct games, namely one game that targets vocabulary knowledge and two games that target main idea identification. We not only examine students' performance in each game individually, but also explore the value of combining performance data across all three games. The goal is to assess the extent to which students' performance in the three games is indicative of their reading skills as measured by standardized reading tests.</p><p>In addition to game performance, we consider students' subjective game experience. Based on the literature regarding serious games <ref type="bibr">[50]</ref><ref type="bibr">[51]</ref><ref type="bibr">[52]</ref>, we expect that most students will have overall positive attitudes toward playing the games. However, it is unknown whether students' game enjoyment will influence the validity of the stealth assessment. To that end, we address the following research questions in this study:</p><p>To what extent does students' performance in the three games predict their reading skills (i.e., vocabulary knowledge and reading comprehension)? 2.</p><p>Does students' enjoyment of the games moderate the relations between game performance and reading skills?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Methods</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Experimental Environment: iSTART</head><p>The Interactive Strategy Training for Active Reading and Thinking (iSTART) is a game-based intelligent tutoring system (ITS) designed to help students improve reading skills through adaptive instruction and training. iSTART was developed based upon Self-Explanation Reading Training (SERT) <ref type="bibr">[53]</ref>, a successful classroom intervention that taught students to explain the meaning of texts while reading (i.e., self-explain) through the use of comprehension strategies (i.e., comprehension monitoring, paraphrasing, predicting, bridging, and elaborating). The current version of iSTART includes three training modules focusing on self-explanation, summarization, and question asking <ref type="bibr">[54]</ref>.</p><p>iSTART learning materials consist of video lessons and two types of practice: regular and game-based practice. Video lessons provide students with information about comprehension strategies and prepare them for the practice. During the regular practice, students complete given tasks, and the system provides immediate feedback on students' performance. For example, when students generate a self-explanation on a target sentence, the NLP algorithms implemented in iSTART automatically analyze the self-explanation and provide real-time feedback. The feedback includes a holistic score on a scale of 0 ("poor") to 3 ("great") and actionable feedback to help students improve the self-explanation when the score is below a certain threshold <ref type="bibr">[55]</ref>. Studies that investigate the effectiveness of iSTART indicate that iSTART facilitates both comprehension strategy learning and comprehension skills <ref type="bibr">[53,</ref><ref type="bibr">54,</ref><ref type="bibr">56]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.1.">iSTART Games</head><p>iSTART implements two forms of game-based practice to increase learners' motivation and engagement: generative and identification games <ref type="bibr">[54,</ref><ref type="bibr">56]</ref>. In generative games, students are asked to construct verbal responses such as self-explanations. NLP-based algorithms assess these constructed responses to determine the quality and/or the use of specific strategies. In contrast, identification games ask students to review short example stimuli or prompts, and then to choose one or more responses that correctly identify strategy use or follow from the prompts. For example, students might read an example self-explanation and then indicate whether the excerpt demonstrates "paraphrasing" or "elaborating." In a vocabulary game, students may be given a prompt term and then must choose a correct synonym from several choices. Importantly, alternatives typically comprise carefully generated foils, such that incorrect answers are diagnostic of student misunderstanding.</p><p>iSTART games also include narrative scenarios and other challenges to further motivate reading strategy practice. For instance, students are rewarded with "iBucks" during gameplay, which can be "spent" to unlock additional game backgrounds or to customize personal avatar characters <ref type="bibr">[54]</ref>. In addition, students receive immediate feedback during or after gameplay to support their self-monitoring and engagement <ref type="bibr">[52]</ref>. For example, in Showdown, students compete against a computer-controlled player to explain target sentences in given texts. At the end of each round, the system evaluates students' answers and informs them of their performance ("poor", "fair", "good", or "great"). Meanwhile, the performance scores of students are compared with the computer-controlled player to determine who wins the round (see Figure <ref type="figure">1</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.2.">Adaptivity Facilitated by Assessments in iSTART</head><p>iSTART implements both inner-loop and outer-loop adaptivity to customize instruction to individual students. Inner-loop adaptivity refers to the immediate feedback students are given when they complete an individual task, and outer-loop adaptivity refers to the selection of subsequent tasks based on students' past performance <ref type="bibr">[57]</ref>.</p><p>Regarding inner-loop adaptivity, the generative games utilize NLP (e.g., LSA) and machine learning algorithms to assess constructed responses, and then provide holistic scores and actionable, individualized feedback <ref type="bibr">[55,</ref><ref type="bibr">56]</ref>. Within identification games, the assessment of the answers matches students' selection with predetermined answers. The system then provides timely feedback including response accuracy, explanations of why the responses are correct or incorrect, and game performance scores.</p><p>To further promote skill acquisition, iSTART complements the inner-loop with outerloop adaptations, which select practice texts based on the student model and the instruction model. An ITS typically employs three elements to assess students and select appropriate tasks: the domain model, the student model, and the instructional model <ref type="bibr">[57]</ref><ref type="bibr">[58]</ref><ref type="bibr">[59]</ref>. The domain model represents ideal expert knowledge and may also address common student misconceptions. The domain model is usually created using detailed analyses of the knowledge elicited from subject matter experts. The student model represents students' current understanding of the subject matter, and it is constructed by examining student task performance in comparison to the domain model. Finally, the instructional model represents the instructional strategies. It is used to select instructional content or tasks based on inferences about student knowledge and skills. iSTART creates student models using students' self-explanation scores and scores on multiple-choice measures. The instructional model then determines the features of each presented task (i.e., text difficulty and scaffolds to support comprehension) using the evolving student model. For example, subsequent texts become more difficult if students' self-explanation quality on prior texts is higher. Conversely, when students' self-explanation quality is lower, the subsequent texts become easier <ref type="bibr">[54]</ref>.  Current adaptation facilitated by assessments in iSTART is at the micro-level. Specifically, the assessments are task-specific and may not transfer to the games implementing different types of tasks. For example, students' scores on self-explanation games may not reflect their performance in question-asking games. As such, students' self-explanation scores are not leveraged to guide the adaptivity (e.g., learning material selection) in the question-asking module. We anticipate that task-general assessments, such as reading skill assessments, have strong potential to supplement current assessments to guide the macro-level adaptation across modules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Participants</head><p>Participants were 246 adults recruited on Amazon Mechanical Turk (an online platform). Thirty-five participants were excluded from the study due to failing an attention check, which resulted in the final sample of 211 participants (98 female, 113 male). In the final sample, 77.2% of participants identified as Caucasian, 10.9% as African American, 6.6% as Hispanic, 4.3% as Asian, and 1.0% as another race/ethnicity. Participants were 37.2 years old, on average, with a range of 17 to 68. Most participants (81.5%) reported holding a Bachelor's or advanced degree.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Procedure, Materials, and Measures</head><p>Participants first responded to a demographic questionnaire, and then played three iSTART games: Vocab Flash, Dungeon Escape, and Adventurer's Loot. Game order was counterbalanced. After every game, participants completed a brief questionnaire regarding their enjoyment. In the final step of the study, participants completed the Gates-MacGinitie Reading Test (GMRT), which included a vocabulary and a comprehension subtests.</p><p>Gates-MacGinitie Reading Test (GMRT). Participants' reading skills were measured by the Gates-MacGinitie Reading Test (GMRT) level 10/12 form S. The GMRT is an established and reliable measure of reading comprehension (&#945; = 0.85-0.92) <ref type="bibr">[60]</ref>, which comprises both vocabulary and comprehension subtests. The vocabulary subtest (10 min) includes 45 multiple-choice questions that ask participants to choose the correct definition of target words in the given sentences. The comprehension subtest (20 min) consists of a series of textual passages with two to six multiple-choice comprehension questions per passage. There are a total of 48 questions.</p><p>Vocabulary and reading comprehension skills were operationally defined as the total number of correct answers on the vocabulary and reading comprehension GMRT subtests, respectively.</p><p>Vocab Flash. In this game, students read a target word and must choose a synonym out of four alternatives (i.e., one correct choice and three incorrect foils). Students are allotted 5 min to respond to as many terms as possible. For each target word, students are only allowed one attempt to select the answer. After students submit the answer, they receive feedback that (a) indicates whether their answer is correct and (b) clearly highlights the correct response. One key feature of the game is its adaptivity. The target words are classified into nine different levels of difficulty based on their frequency rating in Corpus of Contemporary American English (COCA) <ref type="bibr">[61]</ref>. Uncommon words are typically more challenging. The game begins with the easiest words and progresses to higher levels of difficulty as students answer correctly. However, students can also return to easier levels after repeated errors. As in computer-adaptive testing <ref type="bibr">[62]</ref>, students can fluctuate between levels of difficulty, but more skilled students will generally encounter more difficult items. Game performance in Vocab Flash was measured by the proportion of correctly answered questions.</p><p>Dungeon Escape. Dungeon Escape is an iSTART game in which students are knights trapped in a dungeon. The way to escape it is to earn points by selecting topic sentences of given texts. Each student must complete six texts that are randomly selected from the game's science text pool. Four alternative sentences (i.e., one correct answer and three incorrect foils) are provided for each text. Students are allowed multiple attempts for each question, and they proceed to the next text by selecting the correct answer. Performance in Dungeon Escape was measured by the proportion of correct answers only in students' first attempts, because students may potentially game the system (e.g., try all of the answers sequentially).</p><p>Adventurer's Loot. Adventurer's Loot is an iSTART game in which students are asked to discover the hidden treasures on a map by selecting the main ideas of given texts. There are eight sites on the map, and each site corresponds to a specific text. Students can select a site from the map to explore and work on the corresponding text. Students are allowed multiple attempts on a text. The only way to proceed to the next text is by answering the question correctly, namely, selecting all the main ideas. Importantly, the number of correct answers (i.e., main ideas) in this game varies between texts. For the texts with multiple correct answers, the incorrect answers may be missing main ideas or selecting distractors. To be sensitive to different error types and potential user attempts to game the system, d prime was used as the performance measure of Adventure's Loot. It was based on (1) the proportion of correctly identified main ideas in the first attempt, which was computed using the number of correctly selected answers divided by the number of correct answers, and (2) the proportion of incorrectly selected distractors in the first attempt, which was computed using the number of incorrectly selected answers (i.e., selected distractors) divided by the number of distractors. D prime was calculated using the z score of (1), subtracting the z score of (2).</p><p>Game Survey. After each game, participants responded to six items pertaining to their subjective game enjoyment. These questions were derived from measures implemented in prior studies: <ref type="bibr">(1)</ref> This game was fun to play; (2) This game was frustrating; (3) I enjoyed playing this game; (4) This game was boring; <ref type="bibr">(5)</ref> The tasks in this game were easy; and (6) I would play this game again. Participants rated their agreement with these statements on a 6-point Likert scale ranging from "1" (strongly disagree) to "6" (strongly agree). Student enjoyment of the game was operationalized as the average score of the six items.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Statistical Analyses</head><p>Internal consistency between survey items (i.e., reliability) was measured using Cronbach's alpha calculated with the following formula:</p><p>where N = number of items, c = mean covariance between items, and v = mean item variance. Two items "This game was frustrating" and "This game was boring" were reverse coded before the calculation such that all of the items indicated positive attitudes toward the games.</p><p>Hierarchical linear regressions were conducted to determine whether student game performance and enjoyment predict reading test performance. More specifically, game performance scores and enjoyment scores for each game (Vocab Flash, Dungeon Escape, and Adventurer's Loot) were used to predict vocabulary test scores and comprehension test scores.</p><p>Finally, hierarchical linear regressions were conducted to examine whether participants' performance and enjoyment combined across all three games were better predictors of reading skills than performance and enjoyment of each individual game.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Survey Item Internal Consistency</head><p>Cronbach's alpha of the six survey items for Vocab Flash, Dungeon Escape, and Adventurer's Loot were 0.68, 0.73, and 0.82, respectively. A general accepted rule is that alpha of 0.6-0.7 indicates an acceptable level of reliability, and 0.8 or greater a good level <ref type="bibr">[63]</ref>. Therefore, the scores indicated acceptable to good internal consistency between the survey items.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Descriptive Statistics of Predictor and Predicted Variables</head><p>Table <ref type="table">1</ref> provides descriptive statistics of participants' performance on the vocabulary and comprehension subtests, as well as their performance and enjoyment scores for the three games. Reading tests performance scores were calculated using the proportion of correct answers in the subtests. Game performance scores were calculated using the proportion of correct answers or the proportion of correct and incorrect answers, depending on the game. Game enjoyment scores were calculated using the sum of participants' ratings on the game enjoyment survey. As is shown in Table <ref type="table">1</ref>, participants' vocabulary and comprehension test scores were strongly and positively correlated (r = 0.76). The correlation between game performance scores and reading test scores were different for each game. Students tended to enjoy playing the games, particularly Vocab Flash (M = 4.34). However, the strength and direction of the correlations between game enjoyment and participants' reading test scores varied between games. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Predicting Vocabulary Knowledge with Individual Game Performance</head><p>Using hierarchical linear regression analyses, we explored whether vocabulary test scores could be predicted based on game performance and enjoyment for each game. Game performance measures were entered as predictors in Model 1, and then both game performance and enjoyment were entered as predictors in Model 2. More specifically, performance measures refer to the proportion of correct answers in Vocab Flash, proportion of correct answers for the first attempts in Dungeon Escape, and d prime scores for the first attempts in Adventurer's Loot (the calculation of d prime scores is introduced in Section 2.3). Enjoyment refers to the sum of participants' self-reported scores on the survey items. As is shown in Table <ref type="table">2</ref>, participants' performance in game Vocab Flash was a strong predictor and explained 57% of the variance in their vocabulary test scores. Their enjoyment of the game did not account for additional variance. For Dungeon Escape and Adventurer's Loot, participants' performance scores were again significant predictors of their vocabulary test scores. Their performance scores accounted for 20% of the variance in both games. The additional variance explained by game enjoyment was higher in Adventurer's Loot (24%) than in Dungeon Escape (6%).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Predicting Comprehension Test Scores with Individual Game Performance</head><p>As with vocabulary, hierarchical linear regression analyses sought to predict comprehension test scores based on game performance and enjoyment for Vocab Flash, Dungeon Escape, and Adventurer's Loot. In Model 1, game performance was entered as the sole predictor of comprehension test scores. Specifically, performance was measured by participants' proportion of correct answers in Vocab Flash, proportion of correct answers for the first attempts in Dungeon Escape, and d prime scores for the first attempts in Adventurer's Loot. In Model 2, both performance and enjoyment were entered as predictors of comprehension. For Vocab Flash, participants' performance scores explained 36% of the variance in comprehension test scores, with enjoyment adding no extra variance. For Dungeon Escape and Adventurer's Loot (see Table <ref type="table">3</ref>), participants' performance scores were significant predictors, which accounted for 25% and 18% of the variance in the comprehension test scores, respectively. The additional variance for which enjoyment accounted was 2% in Dungeon Escape and 18% in Adventurer's Loot.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Predicting Vocabulary Knowledge from Performance Combined across Games</head><p>In addition to the analyses examining each game separately, a hierarchical linear regression was conducted to predict vocabulary test scores based on their performance and enjoyment across all three games combined. The performance measures in Vocab Flash, Dungeon Escape, and Adventurer's Loot were predictors of vocabulary test scores in Model 1. More specifically, the predictors were participants' proportion of correct scores in Vocab Flash, proportion of correct scores of the first attempts in Dungeon Escape, and d prime scores of the first attempts in Adventurer's Loot. Participants' performance in all three games were significant predictors of their vocabulary test scores and explained 65% of the variance. The explained variance was higher than that explained by performance measures in any individual game. Model 2 predicted vocabulary test scores with both performance and enjoyment scores in the three games. Game enjoyment scores added 9% of explained variance in vocabulary test scores (see Table <ref type="table">4</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.">Predicting Comprehension Test Scores from Performance Combined across Games</head><p>A second hierarchical linear regression sought to predict comprehension test scores based on participants' game performance and enjoyment across all three games. Model 1 only included game performance measures in the three games as predictors. Results indicated that the performance scores of all three games were significant predictors of comprehension test scores and they accounted for 49% of the variance. Model 2 included both game performance and enjoyment measures in the three games as predicting variables. The significant predictors of comprehension test scores were participants' performances in Vocab Flash and Dungeon Escape and their enjoyment of Adventurer's Loot. The additional variance explained by game enjoyment beyond the performance scores was 6% (see Table <ref type="table">5</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>In the current study, we investigated the feasibility of game-based stealth literacy assessment using games from the iSTART ITS. Specifically, we explored to what degree learners' game performance and enjoyment in three games (i.e., Vocab Flash, Dungeon Escape, and Adventurer's Loot) were able to predict their vocabulary knowledge and reading comprehension skills. In addition, we examined whether the associations between reading skills and game performance were moderated by participants' enjoyment of the games.</p><p>Our results suggest that game performance was predictive of participants' reading skills. Specifically, performance in Vocab Flash, Dungeon Escape, and Adventurer's Loot accounted for respectively 57%, 20%, and 20% of the variance of participants' vocabulary knowledge. The explained variance increased to 65% when the combined performance of all three games was used as a predictor. Performance in Vocab Flash, Dungeon Escape, and Adventurer's Loot explained respectively 36%, 25%, and 18% of the variance of participants' comprehension. The explained variance increased to 49% when using the combined performance across all three games as a predictor. These findings demonstrate that student performance in relatively well-designed reading games can provide valid measures of reading skills. As such, reading games may be a viable alternative to standardized reading tests, which can render the testing experience more enjoyable, motivating, and engaging <ref type="bibr">[5,</ref><ref type="bibr">52,</ref><ref type="bibr">64]</ref>. Another benefit of using games for assessments is that students can be assessed during gameplay, without being interrupted or feeling "tested". This stealthy approach may reduce test anxiety, and in turn increase the reliability of the assessment <ref type="bibr">[18,</ref><ref type="bibr">19,</ref><ref type="bibr">65]</ref>.</p><p>One approach to assessing reading skills in prior research has been in the context of constructed responses wherein students generate self-explanations or essays. The linguistic features of those responses were found to be indicative of students' vocabulary knowledge and comprehension skills, which suggests the feasibility of stealth reading assessment using games that embed open-ended questions <ref type="bibr">[26,</ref><ref type="bibr">27]</ref>. This study took a different approach by focusing on games implementing multiple-choice questions. Notably, students are less likely to consider the tasks to be multiple-choice questions because they are presented in the context of games. Our results indicate such games can also provide a means to stealthily assess reading skills, which complements the use of NLP methods for reading assessment. In the context of serious games and ITSs, stealth assessment affords ways to evaluate students' literacy skills and update student models as they naturally interact with the software. The stealth assessment of students' reading skills can augment the macro-level adaptation of ITS, such as guiding students' practice across modules. For example, the system may recommend students with lower reading skills to play more summarization games, but direct students with higher reading skills to engage in more difficult practice within the self-explanation module.</p><p>Another focus of this study was game enjoyment. Although participants tended to enjoy the games, game enjoyment was associated with reading skills differently in the three games. Reading skill and enjoyment of Vocab Flash were not correlated. However, reading skill was negatively correlated with enjoyment of Dungeon Escape and Adventurer's Loot: participants who enjoyed these two main idea games more had lower scores on the reading skills tests. This result supports the notion that those who are more likely to perform poorly and potentially be frustrated or anxious during a traditional test are also more likely to appreciate playing a game rather than taking a test. On the flip side, participants with higher reading skills may have had more positive experiences taking and succeeding on traditional tests, and thus had less appreciation for the games.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Implications</head><p>Literacy assessment is a key component of any effort to improve learners' literacy or remediate potential gaps in education. However, such assessments can be slow, disconnected from learning experiences, anxiety-inducing, or boring. The results from this study imply an important practical application as it provides a means to measure learners' literacy skills in real time via games. Stealth assessment via serious games can also inform adaptive instructional paths for students. Serious games and intelligent tutors may act as scaffolding for less skilled readers to receive more personalized instructions to enhance their skills. Furthermore, game-based assessment can replace traditional paper-based literacy measures, which are decontextualized, cause stress and anxiety for test takers, and in turn negatively impact the reliability of these assessments <ref type="bibr">[66]</ref><ref type="bibr">[67]</ref><ref type="bibr">[68]</ref>.</p><p>Notably, the games used for stealth assessment in this study were relatively simplistic. Thus, our findings indicate that simple, relatively inexpensive games can be leveraged to assess skills. Nonetheless, more elaborate, immersive games with comparable embedded pedagogical features have strong potential to augment the power of stealth assessment. Note that most participants in this study had Bachelor's or advanced degrees. Future research will involve a broader range of participants with respect to prior education, which will enable assessment of the generalizability of current findings.</p></div></body>
		</text>
</TEI>
