<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Examining Limits of Small Multiples: Frame Quantity Impacts Judgments with Line Graphs</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE Transactions on Visualization and Computer Graphics</publisher>
				<date>01/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10503942</idno>
					<idno type="doi">10.1109/TVCG.2024.3372620</idno>
					<title level='j'>IEEE Transactions on Visualization and Computer Graphics</title>
<idno>1077-2626</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Helia Hosseinpour</author><author>Laura E. Matzen</author><author>Kristin M. Divis</author><author>Spencer C. Castro</author><author>Lace Padilla</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Small multiples are a popular visualization method, displaying different views of a dataset using multiple frames, often with the same scale and axes. However, there is a need to address their potential constraints, especially in the context of human cognitive capacity limits. These limits dictate the maximum information our mind can process at once. We explore the issue of capacity limitation by testing competing theories that describe how the number of frames shown in a display, the scale of the frames, and time constraints impact user performance with small multiples of line charts in an energy grid scenario. In two online studies (Experiment 1 n = 141 and Experiment 2 n = 360) and a follow-up eye-tracking analysis (n = 5), we found a linear decline in accuracy with increasing frames across seven tasks, which was not fully explained by differences in frame size, suggesting visual search challenges. Moreover, the studies demonstrate that highlighting specific frames can mitigate some visual search difficulties but, surprisingly, not eliminate them. This research offers insights into optimizing the utility of small multiples by aligning them with human limitations.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Fig. <ref type="figure">1</ref>: Example of the small multiple stimuli used in Experiment 1 that varied in frame quantity from 2 to 70, incremented by four frames. The stimuli depicted power (in megawatts) over time (one year per frame).</p><p>Abstract-Small multiples are a popular visualization method, displaying different views of a dataset using multiple frames, often with the same scale and axes. However, there is a need to address their potential constraints, especially in the context of human cognitive capacity limits. These limits dictate the maximum information our mind can process at once. We explore the issue of capacity limitation by testing competing theories that describe how the number of frames shown in a display, the scale of the frames, and time constraints impact user performance with small multiples of line charts in an energy grid scenario. In two online studies (Experiment 1 n = 141 and Experiment 2 n = 360) and a follow-up eye-tracking analysis (n = 5), we found a linear decline in accuracy with increasing frames across seven tasks, which was not fully explained by differences in frame size, suggesting visual search challenges. Moreover, the studies demonstrate that highlighting specific frames can mitigate some visual search difficulties but, surprisingly, not eliminate them. This research offers insights into optimizing the utility of small multiples by aligning them with human limitations.</p><p>Index Terms-Small multiples, time-series data, cognition I. INTRODUCTION S mall multiples have been a popular visualization technique since the late 1800s <ref type="bibr">[1]</ref>. They present different views of a dataset through multiple small frames <ref type="bibr">[2]</ref>. These frames maintain a consistent scale and axes and are typically arranged in a two-dimensional grid layout <ref type="bibr">[3]</ref>- <ref type="bibr">[5]</ref>. Visualizations from displays of election data to population demographic shifts utilize small multiples <ref type="bibr">[6]</ref>, <ref type="bibr">[7]</ref>. Their popularity can likely be attributed to extensive research demonstrating their effectiveness in conveying complex data trends <ref type="bibr">[8]</ref>- <ref type="bibr">[12]</ref>.</p><p>Although previous research has demonstrated small multiples' effectiveness <ref type="bibr">[8]</ref>, their potential constraints remain under-investigated. Notably, human cognitive capacity limits and visual search capabilities could inform guidelines for designing small multiples for large datasets. Cognitive capacity limits refer to the maximum amount of information the mind can effectively process and retain at any given moment <ref type="bibr">[13]</ref>. Visual search capabilities are how the visual system completes pattern recognition and identifies task-relevant items in an array <ref type="bibr">[14]</ref>. Cognitive capacity limits and visual search capabilities are two theories that could inform when and how small multiples become less effective. However, no work has evaluated how cognitive processes drive the use of small multiples.</p><p>Missing guidance about the cognitive underpinning of small multiples becomes apparent in environments such as energy grid control rooms (see Figure <ref type="figure">2</ref>). Energy grid control displays might feature over 100 small multiple frames, potentially challenging cognitive limits, and visual search capabilities. Without guidance on the impact of frame quantity on performance, designers may produce displays that cognitively overload analysts, potentially leading to errors, such as data misinterpretation or oversight. Understanding the relationship between cognitive mechanisms and small multiples is even more critical when analysts face time <ref type="bibr">[15]</ref> or display constraints <ref type="bibr">[16]</ref>, which underscores the need for aligning small multiple design guidelines with human cognitive abilities. Fig. <ref type="figure">2</ref>: Pennsylvania-New Jersey-Maryland (PJM) Interconnection control room, using numerous small multiples <ref type="bibr">[17]</ref>.</p><p>the scale of the frames and the presence of time constraints. Finally, in a small follow-up eye-tracking investigation, we explored participants' visual search strategies during the tasks.</p><p>Overview and Contributions: This work contributes the first empirical evidence that increasing the number of frames in small multiples results in a linear decline in accuracy. Although the notion that larger frame quantities are more difficult to use may seem obvious, this observation challenges the practice of using small multiples in vast datasets without adding appropriate cognitive supports. For example, we found that highlighting specific frames for comparison can mitigate the negative impact of increased frame quantities on accuracy. The second significant contribution of this work is that it provides converging evidence that visual search strategies are an essential driver of small multiples usage. The implications of this finding are that interventions that focus on making the visual search process easier may provide the most benefits for small multiples. Visual search interventions would allow users to explore specific frames selectively, such as zooming, panning, and filtering <ref type="bibr">[18]</ref>, reducing the reliance on visual search. This work also provides an example of using converging evidence to test competing cognitive theories, which is a practice that could strengthen theory development in visualization research <ref type="bibr">[19]</ref>- <ref type="bibr">[21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK A. Small Multiples</head><p>Small multiples can mitigate the risk of overlooking critical information patterns by providing a broader view of the data <ref type="bibr">[12]</ref>. Their application includes multifaceted exploration <ref type="bibr">[22]</ref>, pattern exploration <ref type="bibr">[23]</ref>, and general data exploration <ref type="bibr">[12]</ref>. Small multiples also facilitate direct visual comparisons, particularly for comparing, exploring, and analyzing time series data <ref type="bibr">[8]</ref>, <ref type="bibr">[9]</ref>, <ref type="bibr">[24]</ref>- <ref type="bibr">[27]</ref>. Furthermore, they facilitate data parameterization <ref type="bibr">[28]</ref>, <ref type="bibr">[29]</ref>.</p><p>Researchers have examined various tasks with small multiples, including topology-based, adjacency, and connectivity tasks <ref type="bibr">[11]</ref>; trend comparison <ref type="bibr">[8]</ref>, <ref type="bibr">[10]</ref>; maximum, slope, and discrimination commonly used for temporal visualizations <ref type="bibr">[9]</ref>; and visual exploration such as identify, correlate, compare, and cluster tasks <ref type="bibr">[12]</ref>. These tasks often involve questions with correct answers, allowing for accuracy assessment.</p><p>A study examining the effectiveness of trend animation in simulated presentation and analysis scenarios found that small multiples result in more accurate visualization analysis than trend animation <ref type="bibr">[8]</ref>. Additionally, studies have found that small multiples lead to expedited task completion (although results can be task dependent), fewer errors, and improved accuracy when contrasted with trend animation <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>, <ref type="bibr">[30]</ref>. Scholars recommend maintaining the same encoding across frames <ref type="bibr">[31]</ref>- <ref type="bibr">[33]</ref> and chart type arrangements <ref type="bibr">[34]</ref>. Another study addressed small multiples' success with bar and line encodings across resolutions and numbers of displays <ref type="bibr">[35]</ref>.</p><p>Finally, previous research explored different quantities of frames. For instance, in the study comparing trend animation, a static image, and small multiples, researchers performed tests with small (i.e., 18) and large (i.e., 80) quantities of frames, finding that participants made fewer errors with the smaller dataset <ref type="bibr">[8]</ref>. Across the spectrum of studies, frame quantities tested have included 2 <ref type="bibr">[27]</ref>, <ref type="bibr">[34]</ref>, 4 <ref type="bibr">[9]</ref>, 12 <ref type="bibr">[36]</ref>, 16 <ref type="bibr">[10]</ref>, with some studies capping the number at 25 <ref type="bibr">[12]</ref>, 48 <ref type="bibr">[37]</ref>, or even 1,178 frames <ref type="bibr">[4]</ref>. However, questions remain regarding the performance of different task types over different quantities of frames, especially within the framework of cognitive capacity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Cognitive Effort vs. Visual Search Theories</head><p>To investigate the relationship between the number of frames and task performance in small multiples, we conducted a series of studies designed to test two competing cognitive theories. These theories make distinct predictions about how accuracy varies with different frame quantities.</p><p>The first theory pertains to cognitive capacity limits <ref type="bibr">[38]</ref>. It suggests that for tasks where users need to retain information mentally, there is a threshold beyond which they cannot or will not hold more information <ref type="bibr">[39]</ref>. Previous research in cognitive science has shown that performance declines once users reach their capacity limit <ref type="bibr">[40]</ref>. If the cognitive capacity limit theory describes performance with small multiples, there should be a point at which performance dramatically declines for tasks requiring users to store data points in their minds and compare across frames. This pattern would resemble an inverse sigmoidal curve as seen in past work (e.g., <ref type="bibr">[40]</ref>, <ref type="bibr">[41]</ref>). Understanding the specific number of frames at which performance deteriorates would be crucial for designers to know and account for in their designs.</p><p>Another viable theory suggests that working with small multiples does not rely heavily on cognitive capacity. Instead, users might rely on their visual system for pattern recognition and identification of specific data points. Visual search tasks use cognitive effort, but the effort does not increase drastically with larger search arrays <ref type="bibr">[14]</ref>. Based on the visual search theory, we would expect a linear decline in performance as the number of frames increases <ref type="bibr">[14]</ref>, <ref type="bibr">[42]</ref>. Such a decline could imply that the main challenge of adding more frames is that it becomes more time-consuming and requires greater effort for users to identify the interesting frames. Determining which theory better describes performance will help designers create visual supports that cognitively assist their users. participants' accuracy. Experiment 2 introduced controlled conditions for frame scale and time constraints. Finally, we conducted a supplemental eye-tracking study to examine participants' visual search patterns. Experiment 1: In this experiment, we evaluated how an increase in small multiple frames impacted the accuracy of participants' responses. We created a set of small multiples with full-scale frames that varied in frame quantity (see Figure <ref type="figure">1</ref>). Scale refers to the size of the frames, and full-scale means that the frames were scaled to fill the screen.</p><p>Experiment 2: For the second experiment, we added two conditions to control the frames' scale and the time taken to complete the tasks. We created a set of fixed-scale stimuli for which all frame sizes remained equally sized regardless of frame quantity (Figure <ref type="figure">4</ref> shows examples). We also controlled for the time taken to complete each task. We presumed that with larger frame quantities, participants would take longer to complete the tasks, which would interact with accuracy. We included a condition in which participants had unlimited time and one where they had 40 seconds to complete each task. We chose 40 seconds by testing the study with our five research assistants and averaging their times.</p><p>Follow-up eye-tracking: Finally, we conducted an exploratory eye-tracking study to examine how different tasks and frame sizes impacted participants' visual search strategies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Tasks, Design, and Procedures</head><p>To complete the tasks, the participants were asked to assume the role of an energy grid operator, and their job was to monitor the energy output. We used Brehmer and Munzner's <ref type="bibr">[43]</ref> multilevel task typology to guide the selection of the following seven tasks under the Identify, Compare, and Summarize query phases.</p><p>-Identify 1: Click on one graph with the highest peak power.</p><p>-Identify 2: Click on one graph with both the biggest change and the highest average power.</p><p>-Identify 3: Click on all graph(s) where the blue line goes above the dashed gray line.</p><p>-Compare 1: Of these two graphs highlighted in blue, click on one graph with the highest peak power.</p><p>-Compare 2: Of these two graphs highlighted in blue, click on one graph with both the biggest change and the highest average power.</p><p>-Summarize 1: Is the general trend in the graphs going up or down? -Summarize 2: What is the average power across all plots?</p><p>The Identify and Compare tasks required participants to click on the frame/s. The Summarize 1 task was a multiplechoice question, and Summarize 2 was an open-ended question for which the participants gave numerical answers. Each task had a correct answer, detailed in Section IV. The accuracy of participants' responses was evaluated using the correct answers, but they did not receive performance feedback.</p><p>We designed each task to examine different aspects of the two competing theories we were interested in testing in this work. We will describe our detailed predictions in the following section (III-C).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Hypothesis</head><p>To test the competing theories, we developed a series of tasks aimed at probing aspects of each theory. We designed the Identify 1 and 3 tasks to rely primarily on visual search, where participants could scan the full display and find the frame with the highest power or those that crossed a threshold.</p><p>For the Identify 2 task, participants had to hold two pieces of information in their minds (the biggest change and the highest average power) and update them as they compared the frames. Holding several pieces of information in the mind requires mental storage and, therefore, non-negligible amounts of cognitive capacity. Similarly, both Summarize tasks required participants to mentally average the data across all the frames. We expected this activity to be highly cognitively demanding and to become more demanding with more frames. We also did not want to test conditions that confirmed only our predictions. Therefore, we designed the Compare tasks not to require any serious visual search or cognitive demands as a control.</p><p>We designed the tasks to explore aspects of the various theories, but we did not know which theory was correct. Therefore, we preregistered hypotheses more generally about overall accuracy across the tasks on the Open Science Framework (OSF), which were:</p><p>&#8226; H1A: For the Identify and Summarize tasks, we hypothesized that accuracy would worsen with increased frame quantity. &#8226; H2A: For the Compare tasks, we hypothesized that there would be no change in performance accuracy as frame quantity increased because participants were comparing two frames. We conducted Experiment 2 to control for some possible confounds in Experiment 1. In particular, as the number of frames increased, the size of the frames became smaller. It was unclear from the findings of the first experiment if our results were due to the increasing frame quantity or the decreasing resolutions of each frame. The other possible confound in Experiment 1 was that participants could take longer to scan the small multiples with larger frame quantities. To control for both, we ran Experiment 2, in which we compared small multiples with fixed-sized frames to frames scaled to fill the screen. We also included a condition in which participants had unlimited time or time constraints. For the tasks that we designed to rely on visual search and cognitive capacity, we hypothesized that introducing time constraints would magnify the effects in those conditions by not allowing participants to improve their accuracy by simply taking longer. We preregistered the following hypotheses regarding time pressure.</p><p>&#8226; H1B: For the Identify and Summarize tasks, we hypothesized that there would be differences in performance accuracy between participants with or without time constraints. &#8226; H2B: For Compare tasks, we hypothesized that there would be no difference in performance accuracy between participants with or without time constraints. Design. Experiment 1 used a 7 (Identify 1, Identify 2, Identify 3, Compare 1, Compare 2, Summarize 1, and Summarize 2 tasks) &#215; 18 (2 to 70 frames, incremented by four) &#215; 20 (randomly generated seeds) mixed study design. The tasks were between-subjects, and we randomly assigned participants to one of the task groups. Frame quantities (see Figure <ref type="figure">1</ref>) and randomly generated seeds were within subjects. Frame quantities were blocked, so participants completed the 20 trials for each frame quantity together. We randomized the order in which they completed the blocks. We also presented the 20 trials within each block in a randomized order, which used a different seed to generate different plots of the same frame quantity. Each participant completed a total of 360 trials.</p><p>Experiment 1 was intentionally long and served the purpose of creating a normed dataset for determining accuracy across various stimuli, a common practice in psychological sciences (e.g., <ref type="bibr">[44]</ref>). This practice informed our selection of representative stimuli for Experiment 2, ensuring that outliers due to data generation aspects were avoided.</p><p>Experiment 2 was also a mixed 4 (full-scale unlimited time, full-scale time-constrained, fixed-scale unlimited time, and fixed-scale time-constrained) &#215; 7 (Identify 1, Identify 2, Identify 3, Compare 1, Compare 2, Summarize 1, and Summarize 2 tasks) &#215; 5 (2, 6, 10, 30, and 70 frames) design. The scale and time constraints were between-subjects, and we randomly assigned participants to one of four between-subjects groups. Tasks and frame quantities were within subjects. Tasks were blocked, so participants completed the five trials for each task together. We randomized the order in which they completed the blocks. We presented participants with the five trials of varying frame quantities within each block in a randomized order, totaling 35 trials. We selected a subset of the frame quantities to be tested with the seven tasks because the results of Experiment 1 revealed a linear trend in accuracy. We, therefore, needed up to only five frame quantities to show the trend again. Additionally, we aimed to mitigate participant fatigue, which was a concern in the lengthy Experiment 1. Experiment 2 was preregistered on OSF (link).</p><p>In the exploratory eye-tracking study, participants completed nine trials using a small subset of the stimuli from the prior experiments. The participants completed each of the seven tasks once using a 70-frame stimulus. These trials assessed how the participants' patterns of eye movements changed for the different tasks, using the maximum set size to elucidate the differences between tasks. The other two trials used full-scale and fixed-scale versions of a six-frame stimulus, which participants used to complete the Identify 3 task. The Identify 3 task was selected because Experiment 2 showed that changing the frame size had a significant impact on accuracy for this task. We used the six-frame stimuli because they had enough frames to require visual search and also had a substantial size difference between the fixed-and full-size frames. The nine trials were presented in a different random order for each participant.</p><p>Procedures: The first two experiments were conducted using the online survey software Qualtrics <ref type="bibr">[45]</ref>. Participants provided IRB-informed consent and were required to have a screen size of at least 9.4 &#215; 6.6 inches. They received instructions about how to zoom their browser window to 100% and had to confirm that they completed the steps. Next, participants received instructions about the experiment context, how to read the visualizations, and how to do the tasks. Then, they practiced clicking on a frame, including instructions on deselecting it if they wanted to change their answer (full instructions on OSF). Following the instructions, participants completed a prescreening attention check, and only those who passed proceeded to the study. Demographic data were collected at the end of the experiments.</p><p>Participants in Experiment 2 with time constraints were given 40 seconds per question, and a practice question was included to ensure readiness. After the time limit, the screen auto-progressed with different messages based on the participants' performance. If they answered the questions in time, they were allowed to progress. If they did not answer the question in time, the system warned them and prompted them to click the next button when they were prepared.</p><p>During the eye-tracking study, participants completed nine trials while their eye movements were tracked with a Tobii Spectrum eye tracker recording at 1200 Hz. They were seated with their eyes approximately 60 cm from the computer monitor but could move their heads freely. The stimuli were presented on a monitor with 1920 &#215; 1080 resolution and were 26.5 cm by 19.5 cm on the screen or approximately 25 by 18.5 degrees of visual angle. The individual frames subtended approximately 2.4 degrees of visual angle for the fixed-scale stimuli and 7.6 degrees of visual angle for the full-scale stimuli. Prior to the task, the eye tracker underwent a ninepoint calibration process. At the start of each trial, participants received task instructions on the screen and initiated the task by clicking the mouse. Next, a fixation cross was presented in the center of the screen for one second, followed by the stimulus, which remained on the screen until the participant finished the task for that trial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Stimuli</head><p>For Experiment 1, we generated small multiple line charts with full-scale frames. These stimuli consisted of 18 frame quantities ranging from 2 to 70, shown in Figure <ref type="figure">1</ref>, with each quantity having 20 versions. The small multiples depicted energy output, with wattage on the y-axis and time (one year per frame) on the x-axis across different regions (each frame represented one region). We simulated 20 datasets for each frame quantity using R <ref type="bibr">[46]</ref> by generating time series data, allowing the trend lines' slope, variance, and intercept to vary. The datasets included columns for time, power, and names. We repeated the date sequence for the required quantity of frames and used random numbers (seeded) from a normal distribution to generate power values for each frame. We then combined the columns to create 20 datasets per frame quantity and stored them in a list (stimuli available on OSF).</p><p>The frame array and line charts for each small multiple were generated using the facet wrap function within the ggplot2 package <ref type="bibr">[47]</ref> in R. Additionally, we incorporated a dashed gray line into each frame, indicating a fictional critical energy threshold. We included the threshold lines primarily for our Identify 3 task, in which participants were asked to identify the frames in which the blue line exceeded the dashed gray line. We determined the threshold line value in a manner that guaranteed only two frames in each small multiple surpassed the threshold. Further, for Compare tasks, we highlighted the frames we wanted the participants to compare (see Figure <ref type="figure">3</ref>). We independently selected the highlighted frames' locations for each quantity to ensure randomness. However, we maintained the distance between frames, ensuring that participants never had to compare widely spaced frames. The distance between the highlighted frames was determined based on the two-frame small multiple and consistently applied to all other small multiples. For each small multiple, the same two frames were highlighted for all participants.</p><p>In Experiment 2, we tested five quantities of the small multiples from Experiment 1 (2, 6, 10, 30, and 70) because the results of the first experiment revealed a linear relationship between accuracy and frame number. We selected 2, 6, and 10 as the first three intervals, 30 as the midpoint, and 70 as the last point. We tested both their full-and fixed-scale versions (also created in R) using the same data (see examples in Figure <ref type="figure">4</ref>).</p><p>The follow-up eye-tracking study used one 70-frame stimulus from Experiment 1 and the fixed-and full-size versions of a six-frame stimulus from Experiment 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E. Participants</head><p>We conducted Experiments 1 and 2 online and recruited participants from Prolific <ref type="bibr">[48]</ref> who were paid California min-imum wage ($15 per hour). Experiment 1 took approximately 1.5 hours to complete, and Experiment 2 took approximately 15 minutes. All participants were prescreened to be at least 18 years old, residing in the United States, and limited to participation in only one of the two online experiments.</p><p>A total of 141 participants completed Experiment 1. There were 20 participants in each task condition, with the exception of the Summarize 1 task, which had 21 participants. This sample size was adequate due to the large number of trials (360 trials) in Experiment 1 <ref type="bibr">[49]</ref>, <ref type="bibr">[50]</ref>. There were 70 male, 65 female, and 4 nonbinary/third gender participants. One participant did not indicate their gender. The mean age of participants was 33.48 (SD = 11.72). There were 360 participants in Experiment 2, with 90 in each of the four between-subject conditions. We determined the required sample size (90 participants per group) based on an anticipated effect size calculated from prior work (f 2 = .09) using the pwr <ref type="bibr">[51]</ref> package in R <ref type="bibr">[46]</ref>. The participant breakdown was as follows: 171 male, 174 female, 10 nonbinary/third gender, and 2 preferred not to indicate their gender. The mean age of participants was 25.64 (SD = 15.4).</p><p>The eye-tracking experiment was collected in person at Sandia National Laboratories. The five participants (two male and three female) were Sandia employees who were compensated for their time at their normal hourly rate. It took the participants approximately 8 minutes to complete the eyetracking study, including the calibration phase.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. RESULTS</head><p>We analyzed the participants' accuracy for each task via multilevel logistic and linear regression models using R <ref type="bibr">[46]</ref> with the tidyverse <ref type="bibr">[52]</ref> and fitted mixed-effects models with the lme4 <ref type="bibr">[53]</ref> packages. We generated the visualizations using the ggplot2 <ref type="bibr">[47]</ref> and ggdist <ref type="bibr">[54]</ref> packages. In six of the tasks, we implemented binomial logistic regressions to model accuracy with coded levels of 0 = incorrect and 1 = correct. Participants responded to the seventh (i.e., <ref type="bibr">Summarize 2)</ref> task with an open-ended question, and we modeled the continuous absolute error with a linear regression equation. The data and analysis are in the supplemental materials (OSF link).</p><p>Calculations were made to determine the correct responses for each task. For Identify 1, correct responses were the frame with the highest value. Identify 2's correct responses were selected by multiplying the average power value by the range of power values for each frame and determining the largest number. The coded correct responses for Identify 3 were all the frames for which the power value was larger than the horizontal line value. False positives (i.e., selecting an extra frame) and False negatives (i.e., failing to select a frame) both resulted in an incorrect response. For Summarize 1, if the starting power value was smaller than the ending power value, then the response up was coded as correct and vice versa. The Summarize 2 task required participants to respond with a value they believed to be the average power across all frames. This value was compared to the true value to determine the  absolute error, which was then scaled as a proportion of the possible values based on the range of power values displayed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Accuracy = 1 -Absolute Error Maximum Power Value</head><p>Of the 7,200 trials, two were not recorded by the Qualtrics online survey software <ref type="bibr">[45]</ref>. Four of the 7,198 total responses were removed as they were above the maximum possible power value, resulting in 7,194 total observations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experiment 1</head><p>Experiment 1 examined the relationship between accuracy and frame quantity per task. We predicted that an increase in frame quantity would lead to a decrease in accuracy for the Identify and Summarize tasks (H1A) but not for the Compare tasks (H1B). To test H1A, we conducted six multilevel logistic regressions. We used frame quantity (2-70 incremented by four) as a predictor for accuracy (the outcome variable) in the Identify and Summarize 1 tasks, with random intercepts for the seeds (20 repeated trials of small multiples) and participants (also 20). Accuracy was modeled using this equation:</p><p>where logit(p) represents the log-odds of the probability (p) of the binary outcome between 0 and 1. &#946; 0 is the intercept coefficient, &#946; 1 &#8226;Frame Quantity is the slope of frame quantity that is the main predictor, and b Trial and b Participant are the random intercepts associated with the grouping variables Trial and Participant, respectively. We performed a similar analysis for Summarize 2 by fitting a linear model to the data to predict the continuous untransformed error. Significance was determined if the p-value (p) was below 0.05 and the confidence intervals (CI) did not include 0 <ref type="bibr">[50]</ref>. For the Identify and Summarize tasks (shown in Figure <ref type="figure">5</ref> with blue solid regression lines), the frame quantity significantly reduced accuracy and increased error in each task (H1A confirmed). The model output is shown in Table <ref type="table">I</ref>. To test H1B, we conducted the same analysis on the Compare tasks and found that frame quantity also significantly predicted accuracy in Compare 1 and 2 (H1B unconfirmed). It was surprising that accuracy still decreased when the two relevant frames were highlighted for the participants. Although this decline was significant for the Compare tasks, it was less pronounced than in the Identify tasks, which asked the same question but required participants to assess all the frames instead of focusing on two. We plotted the Compare tasks (orange) with the corresponding Identify tasks (blue) in Figure <ref type="figure">5</ref> (panels 1 and 2). As conveyed by the differences between the slopes, highlighting the relevant frames reduced the impact of frame quantity, but did not eliminate it.</p><p>Follow-up analysis: Identify vs. Compare. We conducted a follow-up analysis to compare the effect of frame quantity in Identify 1 to Compare 1 and Identify 2 to Compare 2. The goal was to determine if there was a significant difference between the slopes of the Identify and Compare tasks. We used the same modeling procedures as previously described but with accuracy predicted by task (Identify vs. Compare), frame quantity, and an interaction between the two. In both models (Identify 1 vs. Compare 1, and Identify 2 vs. Compare 2), the results revealed significant interactions between task and frame quantity, suggesting that the effect of frame quantity is significantly different for the two tasks. For the model comparing Identify 1 to Compare 1 the slope of Identify 1 (b = -.026) was significantly steeper than Compare 1 (b = -.015) (b = .01, z(14,396) = 3.9, p &lt; .001, CI[.005, .017]). A significant interaction also indicated that Identify 2 (b = -.01) had a meaningfully steeper slope than Compare 2 (b = -.005) (b = .005, z(14,396) = 3.08, p = .002, CI[.002, .009]). These findings are annotated in Figure <ref type="figure">5</ref>. Although H1B was not strictly confirmed, highlighting the frames can meaningfully improve accuracy in the tasks we tested.</p><p>Follow-up analysis: task comparison. We used the same multilevel logistic regression equation described above, with accuracy predicted by an interaction between frame quantity and task, while changing the referent task in the model to compute comparisons between all tasks (see Table <ref type="table">I</ref>). We did not include Summarize 2 in this analysis because we did not feel that multiple-choice responses could fairly be compared to a continuous measure of accuracy. Our analysis revealed that the impact of frame quantity was significantly larger for Identify 1 (slope: -2.6% per 4 frames) than the five other tasks. These comparisons suggest that participants' ability to find the frame with the highest power (Identify 1) was the most difficult task for larger frame quantities in small multiple line charts in the context of the stimuli and tasks tested in this study.</p><p>Follow-up analysis: strategies for Identify 2 and Compare 2. In our analysis, we observed that Identify 2 and Compare 2 (second panel of Figure <ref type="figure">5</ref>) had the highest variability. These tasks required participants to select a frame with both the greatest change and the highest average power. Surprisingly, even in the case of Compare 2, with two highlighted frames, accuracy was low (M = 67%, SD = 47%). This low accuracy might be attributed to participants not performing both parts of the tasks concurrently. They instead picked the frame with either the greatest change or the highest average power. In our original analysis, the correct frame had the highest mean &#215; range value. Here, we separately identified the frame with the highest average wattage and the frame with the widest range to evaluate if the participants performed only one part of the task. Figure <ref type="figure">6</ref> illustrates the variations in accuracy calculations. For Identify 2 (top row of Figure <ref type="figure">6</ref>), there is no clear indication that participants exclusively picked the frame with the greatest range or the highest mean. However, for Compare 2 (second row), it does appear that participants more commonly selected the frame with the highest mean.</p><p>Discussion: The primary 1 was that small multiples with more frames were more challenging to use for all tasks we tested. consistently effect of frame quantity was surprising for However, the results that the effect quantity significantly smaller them than for Identify tasks. A limitation of Experiment 1 is that confound between frame quantity and the size of the frames impact these findings. In the stimuli for Experiment 1, we the small multiples to fill up the screen, resulting in large sizes when fewer frames were presented and small sizes with  increased frames (see Figure <ref type="figure">4</ref>). Another consideration with larger frame quantities is that participants might take longer searching through all the frames. To control for frame size and time to complete the tasks, we conducted Experiment 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Experiment 2</head><p>The goal of Experiment 2 was to determine the effects of scale and time to complete the tasks. We hypothesized that time constraints would impact accuracy in the Identify and Summarize tasks (H1B) but not the Compare tasks (H2B). To test these hypotheses, we conducted multilevel model analyses in which accuracy was predicted by frame quantity (2, 6, 10, 30, vs. 70 frames), time constraints (unlimited time vs. 40 seconds), and frame scale (full-scale vs. fixed-scale), with random slopes for each participant. In these models, the untimed and fixed-scale conditions were the referents.</p><p>This analysis revealed that time constraints significantly reduced accuracy for Identify 1 (b = -0.56, z(1,824) = -3, p = .003, CI[-.19, -.92]) and Identify 3 (b = -1.37, z(1,824) = -5.6, p &lt; .001, CI[-.89, -1.85]). The meaningful differences in accuracy for Identify 1 and 3 are annotated in Figure <ref type="figure">7</ref>. Although this finding supports H2B (no impact of time constraints for Compare tasks), it does not fully support H1B, which predicted time constraints to affect all Identify and Summarize tasks. This finding does, however, support the theory that the limiting factor of small multiples with larger frame quantities is the more extensive visual search   they require rather than cognitive capacity limitations. The two tasks impacted by time constraints possibly involved a serial search process, in which participants scanned each frame to identify the highest point (Identify 1) or when data exceeded a threshold (Identify 3). Time constraints would hinder a thorough search and reduce accuracy in such cases. In contrast, Summarize tasks likely relied on ensemble processing, in which the visual system extracts the overall essence without an exhaustive search <ref type="bibr">[55]</ref>. To explore participants' search strategies, we conducted an eye-tracking study to analyze gaze patterns for each task.</p><p>The other main finding is annotated in Figure <ref type="figure">8</ref> where for Identify 3, the fixed-scale graphs (M = 80.7%, SD = 39.5%) had significantly lower accuracy than the full-scale graphs (M = 88.1%, SD = 32.4%) (b = .83, z(1,824) = 3.6, p = .0003, CI[.38, 1.29]).</p><p>Discussion: In Experiment 2, we found that a time constraint of 40 seconds significantly reduced participants' accuracy of for the Identify 1 and 3 tasks. These tasks likely required a serial search, driving We further found that making all small multiple frames the same size regardless of the number of frames significantly reduced participants' accuracy of responses for the Identify 3 task. With smaller-sized frames, the resolution is reduced, making it challenging to detect when the blue line crosses the dotted threshold (shown in Figure <ref type="figure">9</ref>). Interestingly, we did not find a significant impact of frame size for the other tasks. Despite this, we assume that the reduced resolution  of small frames will affect any task that requires fine-grain discrimination of visual features, as in Identify 3. Further, there is certainly a point at which the frame would be too small for any task, but we did not attempt to find the minimum frame size in this work. To dig deeper into these findings, we conducted a follow-up eye-tracking study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Eye-Tracking: Visual Analysis</head><p>Fixations were calculated using the Tobii I-VT fixation filter <ref type="bibr">[56]</ref> (for plots of the fixations, see Figure <ref type="figure">10</ref>). We conducted a visual analysis of the eye-tracking heatmaps and found that, for the Identify tasks, participants visually scanned across the entire array. Participants completed the most thorough visual search for the Identify 3 (mean number of fixations = 75.8) and Summarize 2 tasks (mean number of fixations = 116.4) (see Table <ref type="table">II</ref>). The Compare 1 and 2 tasks did not involve a thorough visual search (16.2 and 33.2 mean fixations, respectively), likely driving the reduced impact of increasing frame quantities on accuracy.</p><p>Participants generally oriented their foveae in the middle of the array in the Summarize 1 task. This gaze pattern could be indicative of ensemble processing, in which the visual system computes summary statistics' gist from a set (for reviews, see Whitney et al. <ref type="bibr">[55]</ref> and Szafir et al. <ref type="bibr">[57]</ref>). Our accuracy results revealed a small overall impact of frame quantity during the Summarize 1 task (slope = -.007), which was significantly smaller than the effects observed in Identify 1, 3, and Compare  1 tasks. Lastly, Summarize 2 showed higher fixation counts on both the frames and the y-axis, reflecting additional mental calculations and more use of the y-axis.</p><p>Impact of frame size on fixations: When the small multiples had only six frames, the participants typically fixated on all frames regardless of their size (see Figure <ref type="figure">11</ref>). On average, they had a higher number of fixations for the fullscale than for the fixed-scale frames in the Identify 3 task (see Table <ref type="table">II</ref>). The larger frames required a higher number of fixations because the features of interest (like the area around the threshold line) were also larger, and participants' fixations were concentrated in those areas. The larger size of the relevant features also can explain why accuracy was higher for the fullscale small multiples in the Identify 3 task, as it was easier the participants to see the details. Discussion: The purpose of the eye-tracking analysis was to investigate the variations in individuals' eye movements across different tasks. It is important to note that the study's small sample size does not allow for broad generalizations to the general population. However, when considering how individuals' visual search patterns change for each task, we see some preliminary patterns that may contextualize the behavioral findings in Experiments 1 and 2. For the small multiples with 70 frames, the participants often scanned across the full array, which is consistent with the visual search theory. The eye movement patterns for the Compare tasks showed that highlighting effectively focused the participants' attention on specific plots <ref type="bibr">[58]</ref>. Regarding frame size, when the frames were larger, the participants fixated more on the parts of the frame that were important for the task, such as the area above the threshold line. On the full-scale small multiples, these essential features were larger and easier to see than in the fixed-scale small multiples. Although the findings of the eyetracking inquiry are consistent with the behavioral findings, more work is needed to determine if they can be generalized to a more representative population.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. GENERAL DISCUSSION</head><p>This study examined the effect of varying frame quantities in small multiple line charts on participants' performance in seven visualization tasks. We also explored the impact of scale, time, accuracy, and participant strategies. A key discovery was that as the number of frames in small multiples increased, accuracy declined in a linear fashion (Experiment 1 results). The decline in accuracy was not fully explained by a reduction in resolution in smaller frame sizes or differences in task completion times (Experiment 2 results). Such a linear decrease in accuracy points more toward a visual search challenge <ref type="bibr">[14]</ref> rather than a cognitive limitation, which would have been revealed by an inverse sigmoid function of performance <ref type="bibr">[40]</ref>. We found further support for the visual search theory by demonstrating that we can diminish the impacts of large frame quantities by highlighting specific frames, eliminating the need for a full array search. Additional corroboration for this theory is demonstrated in the eye-tracking analysis that shows viewers scanning large portions of the array, except when two frames were highlighted (Compare 1 and 2). This finding reinforces the effectiveness of small multiples, showing that there was not a threshold at which the number of frames exceeded users' cognitive capacity in our test context. Our results suggest that the main limitation of small multiples is not cognitive but practical, in which an excess of frames on a constrained display might render individual frames unreadable or require a great deal of visual searching.</p><p>The observed decrease in accuracy with increased frame quantity in small multiples carries significant implications for user support. Our findings point to extensive visual searching as the primary factor contributing to reduced accuracy. To mitigate this issue, modifications to the user interface can be considered. For instance, the interface could allow users to select specific frames, with the display subsequently updating to showcase only those chosen frames. Techniques such as zooming, panning, filtering, and interactive piling <ref type="bibr">[59]</ref> offer ways for users to selectively explore specific frames <ref type="bibr">[18]</ref>.</p><p>Designers can leverage the insights from this study to determine the desired accuracy level and adjust the number of frames their system allows users to selectively display to achieve the desired accuracy. In this manner, this research can help optimize existing interaction methods to better align with human capabilities.</p><p>Experiment 2 demonstrated that the fixed-scale frames yielded poorer performance compared to the full-scale frames for the Identify 3 task, for which participants were required to pinpoint graphs in which the blue line surpassed the dashed gray line. The decline in accuracy with the smaller frames for this specific task is logical; as the frame size decreases, users' ability to discern instances in which the blue line barely exceeds the threshold becomes challenging. Intriguingly, we did not observe significant performance differences with the smaller frames for the other tasks. We theorize that this might be because the other tasks did not hinge as much on discerning fine details but rather on recognizing the overall trend of the line. It is crucial to note that our study did not encompass a broad spectrum of graph types, marks, or tasks. It is conceivable that specific designs are harder to interpret at reduced sizes, and numerous other tasks may also prove challenging when presented within smaller frames.</p><p>Experiment 2 also showed that time constraints reduced the accuracy of Identify 1 and 3 tasks. Identify 3 required participants to click on all graph(s) where the blue line exceeded the dashed gray line, the only task allowing for more than one frame selection. Identify 3 was also the only task for which participants did not know the correct number of frames. We ensured that there were only two correct frames, but participants were unaware of this. Even if the participants guessed that there were two correct frames, this task would be highly impacted by time pressure. Further, fixed-scale frames made it more challenging to discriminate small gaps between the threshold and power lines (shown in Figure <ref type="figure">9</ref>). The eyetracking data contextualized these results, indicating that when the frames were larger in the Identify 3 task, the participants fixated more on the relevant parts of the frames.</p><p>Limitations and Future Directions: An unexplained finding is that accuracy was significantly impacted by frame quantity for the Compare tasks. Work in cognitive psychology indicates that visual identification tasks are harder to complete when surrounded by distractors <ref type="bibr">[60]</ref>, <ref type="bibr">[61]</ref>. More work is needed to determine if these findings can generalize to data visualizations or if there are other factors influencing the impact of frame quantity on Compare tasks.</p><p>Our study focused on line charts across seven specific tasks and resolutions, but further research is needed to validate our findings. Our observation of decreasing accuracy with increasing frame quantities may have broader applicability, but we acknowledge that our stimuli and tasks may not fully mirror the challenges faced by analysts, such as energy grid operators, in real-world settings. We prioritized experimental control over ecological validity, for example, by maintaining consistent frame distances in the Compare tasks. In these tasks, we chose to ensure that the only change in difficulty was from the number of frames, not the distance between the frames. However, this control may not represent the real-world use of small multiples. Future studies should explore the impact of more naturalistic displays, tasks, and a wider range of chart types.</p><p>Lastly, we encountered several logistical limitations in online Experiments 1 and 2. Firstly, Experiment 1 was lengthy without scheduled breaks, potentially leading to participant fatigue. We performed a follow-up analysis and found no significant effects for order of stimuli upon accuracy, which is in the supplemental materials (OSF link), but Experiment 1 was still quite long. Secondly, we were restricted to US participants due to our institutional guidelines. Future work should consider more diverse populations. Of the 500 participants, all but one met the required screen size criteria. Each participant self-reported whether they had zoomed their browser window to 100%. We cannot verify complete adherence to our zooming guidelines, but we are confident in the participants' integrity, especially since only one individual inaccurately reported having the correct screen size. Still, future work should consider in-person studies to ensure consistent browser window settings among participants. Additionally, conducting in-person studies with real analysts familiar with small multiples, such as energy grid operators, could address participant unfamiliarity and provide more practical insights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSIONS</head><p>The popularity of small multiples has led designers to utilize them in various applications, such as political forecasts and energy grid control rooms. This study demonstrates the effectiveness of small multiple line charts with large datasets for tasks that do not necessitate extensive visual search. On the other hand, we also present examples of tasks that pose significant challenges to users with large arrays of small multiples. Our findings highlight the importance of interaction methods such as zooming, panning, and filtering that can offload the need for extensive visual search of small multiples with larger numbers of frames. Like any visualization, employing small multiples requires careful design, particularly when dealing with large datasets, as there is no one-size-fits-all approach.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>This article has been accepted for publication in IEEE Transactions on Visualization and Computer Graphics. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2024.3372620 &#169; 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information. Authorized licensed use limited to: Northeastern University. Downloaded on May 02,2024 at 00:19:22 UTC from IEEE Xplore. Restrictions apply.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>This article has been accepted for publication in IEEE Transactions on Visualization and Computer Graphics. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TVCG.2024.3372620 &#169; 2024 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See https://www.ieee.org/publications/rights/index.html for more information.</p></note>
		</body>
		</text>
</TEI>
