<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Investigating Object Translation in Room-scale, Handheld Virtual Reality</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>01/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10545743</idno>
					<idno type="doi">10.1109/TVCG.2024.3456154</idno>
					<title level='j'>IEEE Transactions on Visualization and Computer Graphics</title>
<idno>1077-2626</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Daniel Enriquez</author><author>Hayoun Moon</author><author>Doug A Bowman</author><author>Myounghoon Jeon</author><author>Sang Won Lee</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Fig. 1: (a) The environment of the user study with a user using the application. (b) A user superimposed into the VR environment. (c-d) A closeup of the user's perspective with the user performing a translation. (e-g) The object translation tasks that were evaluated in the user studies: (e) 3DSlide, (f) VirtualGrasp, (g) Joystick.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Virtual reality (VR) is a popular platform in which a head-mounted display (HMD) user can interact inside a 3D virtual environment (VE) for any means, whether it be entertainment, education, or work. However, HMDs are not accessible to all people, as people with certain vision impairments, those are susceptible to VR sickness, children <ref type="bibr">[30]</ref>, and those with kinky or coily hair <ref type="bibr">[39]</ref> may choose not to use VR HMDs. Alternatively, handheld devices are inclusive opitons, allowing users to use their peripheral vision and permitting many simultaneous users. Researchers created various handheld VR systems that allow users to observe and participate in VR environments through a mo-</p><p>&#8226; Daniel Enriquez, Hayoun Moon, Doug A. Bowman, Myounghoon Jeon, and Sang Won Lee are with the Center for Human-Computer Interaction at Virginia Tech. E-mail: {denriquez |,moonhy |,dbowman |,myounghoonjeon |,sangwonlee}@vt.edu &#8226; * Equal Contribution Manuscript received xx xxx. 201x; accepted xx xxx. 201x. Date of Publication xx xxx. 201x; date of current version xx xxx. 201x. For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.org. Digital Object Identifier: xx.xxxx/TVCG.201x.xxxxxxx</p><p>bile device screen in which the physical device position matches the position of a camera in the VE, with a 3D environment with 6-degreesof-freedom (DoF) <ref type="bibr">[16,</ref><ref type="bibr">19,</ref><ref type="bibr">27]</ref>. This implementation of handheld VR provides a more inclusive yet less immersive interface into a VE than HMD devices. Prior work's integration of spectators with handheld devices <ref type="bibr">[16,</ref><ref type="bibr">19,</ref><ref type="bibr">52]</ref>, demonstrate strong examples of making VR more collaborative, social, and inclusive in a room-scale physical environment such as classrooms, exhibitions and informal settings.</p><p>Interaction methods are well-studied in HMD-based VR <ref type="bibr">[33]</ref>, but not actively in handheld VR. Handheld VR is similar to handheld AR in terms of input, with similar means of interaction on handheld devices. Bambu&#353;ek et al. <ref type="bibr">[4]</ref>, for example, has shown the potential to swap between handheld AR/VR to see objects that were unreachable, and applied an AR-based UI in VR. Many handheld AR manipulation methods such as 3DTouch, HOMER-S, and SlidAR <ref type="bibr">[41,</ref><ref type="bibr">47]</ref> can be ported into handheld VR. However, handheld AR has typically been studied on a smaller scale, often table-top surfaces <ref type="bibr">[18]</ref>. Given the emergence of handheld VR applications <ref type="bibr">[16,</ref><ref type="bibr">19,</ref><ref type="bibr">27,</ref><ref type="bibr">43,</ref><ref type="bibr">52]</ref>, it is questionable if our understanding of manipulation techniques studied in table-top handheld AR would transfer to the room-scale interactions of handheld VR. Therefore, evaluating handheld AR interaction in handheld VR is crucial to explore how changes in perspective and distance in VR affect the performance and usability given amplified distance.</p><p>Prior literature defined how to understand manipulation methods in 3D environments in both physical <ref type="bibr">[9]</ref> and virtual worlds <ref type="bibr">[10]</ref> by extending the application of Fitts' law to accompany the added dimension of depth. These models incorporate angle-related predictors and contribute to our understanding of 3D object manipulation. However, in handheld VR, users use a 2D display to manipulate objects in a 3D world. Therefore, it remains uncertain whether the adaptations of Fitts' law models for 3D environments could be applied to handheld VR. Another goal of this work is to evaluate if room-scale object translations in handheld VR will comply with these existing Fitts' law models.</p><p>To better understand the performance difference in a room-scale environment, we conducted an experiment that evaluates three different types of translation methods that were designed to cover a wide range of handheld AR interface designs. 3DSlide is a combination of 3DTouch <ref type="bibr">[41]</ref> and SlidAR <ref type="bibr">[47]</ref>, in which, independent of a user's orientation, they can choose a world-coordinate axis along which to translate an object. 3DSlide restricts translation to one axis at a time and serves to understand the level of control offered by this translation scheme. VirtualGrasp is partially derived from HOMER-S <ref type="bibr">[41]</ref>, in which users translate an object that mirrors the movement of their device, with a slider added for translation at further distances. VirtualGrasp allows for the evaluation of device-based manipulation methods as changes in manipulation scale may impose changes to physical efforts, e.g., approaching to target and moving the device. Joystick is derived from virtual joysticks that are popular and common in mobile phone gaming, serving as a local-coordinate, touch-based interaction technique. Joystick allows for an evaluation of a traditional 2D interface design situated in 3D physical environments.</p><p>We also included other relevant factors to the room-scale interaction, the first of which is mobility, i.e., whether a user can move in the room or not. We evaluated two different mobility conditions through two user studies (User Study 1: stationary vs. User Study 2: movable). For each study, we evaluated the performance of three translation techniques (3DSlide, VirtualGrasp, Joystick) with two additional factors: 1) the distance between the target and an object (1m, 2m, 4m), and 2) target size, as in width in Fitt's law, which affect how delicate adjustment needs to be (Small, Medium, Large). We found that optimizing the UI for translation along a manipulation axis worked well, with participants favoring the Joystick method as it was an intuitive compromise between control and usability. Results also indicated that device-based methods such as VirtualGrasp can be physically demanding in prolonged-use scenarios when users are stationary.</p><p>Our contributions are as follows: First, we describe the design of how translation methods studied in a handheld AR can be applied to handheld VR. Second, we provide findings from two user studies to understand how handheld VR translations are affected in room-scale VEs by various factors relevant to scale -distance, target size, and users' mobility according to their UI interface features and how this relates from traditional, table-scale handheld AR counterparts. Lastly, we discuss the theoretical implications of applying Fitts' law to a 3D object translation task utilizing a 2D input system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Handheld Virtual Reality</head><p>Given the wide range of the reality-virtuality continuum, different platforms at varying levels of immersion require different interfaces to effectively interact inside a Virtual Environment (VE) <ref type="bibr">[40]</ref>. While HMDs offer high immersion, handheld devices offer alternatives for broader populations to consume VR content due to HMDs' inaccessibility <ref type="bibr">[30,</ref><ref type="bibr">39]</ref>. Examples of this implementation are TransceiVR and WebTransceiVR <ref type="bibr">[35,</ref><ref type="bibr">52]</ref>, in which HMD's view is transmitted for other devices. Similarly, Drey et al. <ref type="bibr">[11]</ref> analyzed the effectiveness of symmetric (VR+VR) and asymmetric (VR+Tablet) learning, where teachers see into the VE using a teaching view on a handheld device. Another work, XRDirector <ref type="bibr">[43]</ref>, incorporates a handheld device that serves as a viewfinder in VR scenes, allowing users to walk with it and control camera movement in VR filmmaking. These works showcase handheld VR as a distinct advantage over the industry standard of mirroring the egocentric view of a VR HMD user. While the literature provides various use cases of handheld VR scenarios that occur at a room scale, we lack an understanding of how handheld device users can manipulate objects in handheld VR in much bigger spaces, which has been underexplored. To address the gap, our work aims to evaluate translation methods in room-scale handheld VR environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Object Manipulation Techniques</head><p>A broad range of manipulation techniques exist in immersive environments, and translation techniques are fundamental to spatial interaction. With HMD users, VR manipulation methods are extensively wellstudied <ref type="bibr">[33]</ref>, with device-based manipulations, that is, manipulations based on the device's orientation, being common-place as the predominant means of manipulation for immersive VR <ref type="bibr">[7,</ref><ref type="bibr">29]</ref>. These insights help, as the implications of the manipulations allow us to understand how device-based translations function in VR environments and how they may differ from handheld VR.</p><p>Similarly, handheld AR manipulation methods are also well studied, with three distinct classes of interaction: touch-based, gesturebased, and device-based, being commonplace in literature <ref type="bibr">[18]</ref>. Given the abundance of new works that operate in handheld VR contexts <ref type="bibr">[11,</ref><ref type="bibr">16,</ref><ref type="bibr">19,</ref><ref type="bibr">35,</ref><ref type="bibr">43,</ref><ref type="bibr">45,</ref><ref type="bibr">52]</ref>, it is clear that interaction needs to be better defined for handheld VR. Many methods for manipulation from handheld AR to handheld VR are similar, given the "window-like" appearance of glancing into another view. Insights from the extensive literature on handheld AR manipulation are applied to handheld VR because of their similarity, as VR has been an effective means to understand AR <ref type="bibr">[8]</ref>. One example is Bambu&#353;ek et al. <ref type="bibr">[4]</ref>, in which a handheld AR environment is replicated in handheld VR because of reachability limitations; unreachable objects could be accessed by temporarily switching to VR.</p><p>Handheld AR techniques have been closely examined typically in the context of table-top interactions <ref type="bibr">[18]</ref>. In contrast, VEs tend to be larger in scale, typically encapsulating an entire room <ref type="bibr">[16,</ref><ref type="bibr">19,</ref><ref type="bibr">43]</ref>, our work examines interaction in room-scale environments. However, handheld AR techniques have typically been underexplored in larger, room-scale environments. One exception is Hellmuth et al. <ref type="bibr">[23]</ref>, in which devicebased and touch-based interactions created and moved anchor points in room-scale handheld AR and examined time and accuracy in a largerscale environment. Motivated by and extending this work, we aimed to explore how various components such as mobility, target distance, and target size could affect object translation in room-scale environments.</p><p>Often, the interaction methods in handheld AR derive from VR HMD counterparts, such as HOMER-S <ref type="bibr">[41]</ref>, a derivative of HOMER <ref type="bibr">[7]</ref>, in which an object's position is mapped from the device. We adopted a similar implementation of HOMER-S to allow for device-based translations to VirtualGrasp for our evaluation.</p><p>Alternatively, to adopt a touch-based interaction approach, systems such as SlidAR <ref type="bibr">[47]</ref> and 3DTouch <ref type="bibr">[41]</ref> integrate methods to translate objects along a singular axis at a time, either the x, y, or z-axis. SlidAR translates along a fixed, epipolar line, and 3DTouch translates according to the device's pose and touch input. Our work adopts a hybrid of these two methods to allow global-coordinate object translation and included 3DSlide for our evaluation studies.</p><p>Our work also looks into local-coordinate object translation through the exploration of a directional joystick or gamepad, similar to gaming consoles and controllers. One implementation is ChildAR <ref type="bibr">[22]</ref>, in which school children could manipulate a car's position to drive the car in a handheld AR game. Another implementation is Blaga and Gorgan <ref type="bibr">[6]</ref>, in which a drone is operated using two virtual joysticks. To see if we can leverage the UI established in mobile gaming, we developed and evaluated Joystick.</p><p>We considered other relevant interaction techniques but did not include them in our evaluation due to their limited applicability. For instance, gesture-based manipulation techniques were not evaluated as they require users to determine hand position by looking at their own hand to manipulate the object <ref type="bibr">[3,</ref><ref type="bibr">32,</ref><ref type="bibr">44]</ref>.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Fitts' Law in 3D</head><p>Fitts' law is a fundamental, predictive model in human-computer interaction (HCI) and ergonomics, rooted in the determination of an index of difficulty (ID), which exhibits a linear correlation with movement time (MT), increasing with target distance and decreasing with target size <ref type="bibr">[14,</ref><ref type="bibr">15]</ref>. Researchers have extended its application to higher dimensions, as well as similar tasks, including dragging <ref type="bibr">[17]</ref> and object manipulation <ref type="bibr">[20]</ref>. While pointing and manipulation are two distinctive types of object interaction, their discrete phases show similarities: acquisition/grasping, transportation, and correction <ref type="bibr">[53]</ref>.</p><p>Higher dimension models were built upon two-dimensional (2D) pointing tasks employing mouse and monitor setups <ref type="bibr">[1,</ref><ref type="bibr">25,</ref><ref type="bibr">37]</ref> and in three-dimensional (3D) space utilizing the spherical coordinate system <ref type="bibr">[9,</ref><ref type="bibr">10,</ref><ref type="bibr">42</ref>]. The spherical system defines a point's position using three numbers: the radial distance from a fixed point, the inclination angle from a fixed axis, and the azimuth angle of its projection on a fixed plane. These additional angles increase the predictability of Fitts' law in 3D space.</p><p>Handheld VR presents a challenge for applying the extended Fitts' law models due to its inherent ambiguity between 2D and 3D settings. While the object manipulation involves depth and angle, akin to Fitts' law in 3D settings, the hand movement is constrained to a 2D display, with varying restrictions on concurrent DoF control. Intrigued by this discrepancy, the current study aimed to investigate which extended model of Fitts' law best represents the relationship between task design and movement time, particularly concerning dimensionality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DESIGN AND IMPLEMENTATION</head><p>It is necessary to understand how handheld VR's interface functions in a room-scale space. As a result, object translation serves as a critical means to evaluate spatial interaction in 3D environments. Object rotation and object scaling were not studied as these two manipulation methods in a room-scale VE would be similar to those in a smaller VE. As such, different manipulation techniques were created to understand space interpretation and UI in handheld VR.</p><p>In the translation tasks, we used a red spherical ball as a movable object, so the object looked the same from any angle. A semi-transparent spherical ball shows the target location for the translation and turns the movable object green when it reaches the target to indicate a successful translation. No selection would be required to move the movable object.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Translation Techniques</head><p>We implemented three translation techniques, encompassing previously studied handheld AR interaction techniques and familiar techniques.</p><p>3DSlide is a touch-based interaction method that is a modification of 3DTouch <ref type="bibr">[41]</ref> and SlidAR <ref type="bibr">[47]</ref>. Instead of restricting movement to one axis based on device orientation or an epipolar line, it restricts movement based on user input. Users would translate the object by selecting one axis out of three axes from a set of on-screen buttons (three black two-directional arrows on the bottom right corner of a tablet screen in Figure <ref type="figure">2</ref>) and then sliding their fingers anywhere on the screen in the 2D axis representing the direction of the translation. This method uses a global coordinate system, meaning the axis direction will be constant regardless of a user's standing position. To aid users in translation, the VE would display arrows in different colors (green and orange) that match the direction of the selected axis. The global coordinate axes would always align with the user's initial local coordinate axes. If a user were to move to a different position from the starting point, sliders' orientations may not match the users' perspective orientation.</p><p>VirtualGrasp is a device-based interaction method akin to HOMER-S <ref type="bibr">[41]</ref>, in which an object would be positioned by the users' device position. This technique used a "grab" button (a button on the bottom left corner of the screen, as shown in Figure <ref type="figure">3</ref>) that would grasp the object, regardless of distance. The original design of HOMER-S maintained a fixed distance between the grabbed object and the device. A UI slider was added to control object distance (the slider UI on the right side of the screen in Figure <ref type="figure">3</ref>). Distance control is needed to reach outside their physical movement range (e.g., beyond the physical wall) and avoid re-grasping the object to control the distance between the object and the user.</p><p>Joystick is a touch-based interaction method that uses local coordinates to manipulate the object. There are two virtual joysticks, as seen in Figure <ref type="figure">4</ref>. One controls the object along the ground plane, moving the object left, right, forward, and backward relative to the user's perspective, resulting in Joystick using local coordinates. Therefore, the on-screen UI's orientation will always match the user's orientation (except the height), which is the fundamental difference from 3DSlide. The other joystick controls the height, moving the object up and down, as shown in Figure <ref type="figure">4b</ref>. The virtual joysticks are similar to those found in mobile games.</p><p>Each translation technique has varying restrictions on simultaneous DoF control. 3DSlide was restricted to control of 1 DoF at a time, Joystick restricted to 2 DoF (lateral, depth) or 1 (height), and VirtualGrasp with 3 DoF. Each method has its unique way of making the object move faster; For example, if the Joystick handle is at the edge of the circle, it is at maximum speed, but if it is closer to the center, it will move slowly. For VirtualGrasp, a user can move (i.e., walk) faster to increase the speed once the object is grabbed. For 3DSlide, swiping faster on the screen will make the object move faster. Each translation method was designed to translate objects at approximately 1.5 m/s. For VirtualGrasp, one can grab an object and walk at 1.5 m/s, a slightly above-average walking speed (1.33 m/s), for moving an object at 1.5 m/s. For Joystick, placing the virtual knob at one end is mapped to 1.5 m/s. For VirtualGrasp, 1.5 m/s can be achieved by one swipe (5 cm) per second on the screen.</p><p>In terms of the layout of UI components, we distributed UIs to both sides of the screen so that users can easily use them while holding a device with both hands and follow traditional design norms if available (e.g., a virtual D-pad of a joystick is traditionally placed on the left side). While we did not change our design depending on the users' handedness, this limitation reflects real-world situations where controllers or UIs are not configurable by default and controllers designed for left-handed populations are often not available <ref type="bibr">[38]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Virtual Environment Design</head><p>Depth cues were added to aid object translation as seen in Figures <ref type="figure">2</ref>, <ref type="figure">3</ref>, and 4, as is common practice in many VEs <ref type="bibr">[13]</ref>. Lines were on the floor of the VE to indicate 1 meter of physical and VE space on the X (horizontal) and Z (depth) axes, as well as on the walls of the VE to indicate 1 meter of space on the Y (vertical) axis. Shadows were directly beneath the objects to aid in positioning the object along the ground plane and were kept consistent across all conditions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">USER STUDIES</head><p>In this work, we conducted two user studies. In User Study 1, participants performed the translation task at the designated location (stationary), whereas in User Study 2, participants were encouraged to walk to perform the same task (movable). User Study 1 aimed to theoretically examine how independent variables affected performance and to understand translations when users do not walk (e.g., out of reach, a sitting position, a user with mobility impairment). User Study 2 aimed to investigate user experience in a more practical setting. Other than being stationary versus mobile, all procedures and tasks were identical for both user studies. We conducted separate analyses for each user study to mitigate the complexity of our study design, which incorporates three factors per user study. Using mobility as a within-subject factor would have required a larger sample size to achieve statistical power.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experiment Design</head><p>For both user studies, we employed a 3 &#215; 3 &#215; 3 within-subject design involving three independent variables: translation technique, target size, and distance. Each participant experienced all translation techniques -3DSlide, VirtualGrasp, and Joystick. Target size had three levels: large (0.25 m, equivalent to the movable object's radius), medium (0.125 m, 50% the radius); and small (0.075 m, 30% of the radius). Additionally, we varied the distance between the movable object and the target destination across three levels: 1 m, 2 m, and 4 m.</p><p>While angles could be potentially interesting factors to explore, we opted to focus on distance as the primary research interest for this study due to statistical power considerations. Angles were, however, taken into account when designing trials, where we designed a set of predefined target locations, aligning with the parameters outlined by Cha and Myung <ref type="bibr">[9]</ref>. Each experiment condition comprised six trials, varying across three inclination angles (&#952; 1 = 30&#176;, 45&#176;, and 60&#176;) and two azimuth angles (&#952; 2 = 0&#176;, 45&#176;). We chose an upward directional movement (i.e., &#952; 2 = 45&#176;), which can be particularly fatiguing <ref type="bibr">[9,</ref><ref type="bibr">42]</ref>, especially for device-based interaction methods (i.e., VirtualGrasp) involving above-shoulder movement. Based on the experiment design, each participant conducted 162 trials (3 &#215; 3 &#215; 3 &#215; 6).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Measures</head><p>[Completion Time] Our primary aim was to examine the impact of different translation techniques and task-related factors on performance and user behavior. In each trial, we measured the time participants needed to translate the movable object from its initial position to its target. This time spanned from the moment the participant initiated the object's movement by touching the display to when the object reached its final position as determined by the participant. Other time measures, such as the time participants spent on thinking or planning <ref type="bibr">[12]</ref>, were considered unnecessary for the simplicity of our task. Additionally, accuracy was not included in the measure, as it would have relied on individuals' perception of completion state as well as their strategies (e.g., prioritizing speed over precision or vice versa).</p><p>[Fine-tuning Time Percentage] We assessed the percentage of time the participant spent fine-tuning the object's position near the destination, which was defined as a sphere centered on the target location, with a radius equal to 20% of the total distance between the object's initial position and the target location. The time within this region was then divided by the completion time for the trial. The motivation behind evaluating the fine-tuning phase was to understand which control method requires the most time for precise control and which control can be appropriate for placing roughly at a target location, i.e., spending the least time in the initial 80% of the total distance.</p><p>[Device Movement] Additionally, we measured the device movement by calculating the device's speed, i.e., traveled distance per unit of time (m/s). The device's real-time position was sampled at 90 Hz to calculate the total trajectory, then it was divided by the completion time. Device movement is a proxy of how dynamically a user moves. For User Study 1, the metric would indicate how much the participants moved their tablets at a fixed standing position.</p><p>[Subjective Measures] Upon completing each translation condition, participants evaluated its usability using HARUS <ref type="bibr">[49]</ref>, which measures usability through two subcategories, manipulability and comprehensibility, with eight question items. Participants rated their perceived workload using NASA-TLX <ref type="bibr">[21]</ref>. At the conclusion of the study, we collected participants' preferences among the translation techniques and solicited their comments explaining their choices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Setup</head><p>The participants used a Samsung Galaxy A7 Lite (8.7", 32 GB, Released 2021), weighing 366 g <ref type="bibr">[48]</ref> which ran a Unity application and the device's position was located through a VIVE tracker mounted on a mobile device <ref type="bibr">[26]</ref> that was synchronized using Photon Unity Networking <ref type="bibr">[46]</ref> with a computer (Dell Alienware Aurora R12 <ref type="bibr">[2]</ref>) running Unity v2018.3.27f1 [54] and SteamVR <ref type="bibr">[51]</ref>. The system updated 90 frames per second.</p><p>For the physical environment, a carpeted, wide-open space, with the tracked area measuring 4 &#215; 4m 2 and a non-tracked area of 5 &#215; 5m 2 was used throughout the duration of both experiments. There were no obstacles inside of the tracked area, with two Vive Base Stations <ref type="bibr">[26]</ref> mounted at corners, opposite from one another. Participants began in the middle of the room and were faced forward to begin each trial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Procedure</head><p>Each study lasted approximately 60 minutes, with participants being compensated with an Amazon gift card worth $15, with an additional $5 per 30 minutes in case participants took more than 90 minutes. Out of 60 participants, only 4 participants spent more than 90 minutes. Before the study, the participants read a consent form approved by the Institutional Review Board, and the researcher obtained verbal consent from the participants. They filled out a demographic form that asked about their age, and prior experience with video games and mobile games, each on a scale from 1 to 7 (1 = "no experience", 7 = "very experienced"). Per each translation method, the participants conducted ten training trials to get familiar with it, followed by 54 actual trials with randomly selected target sizes, distances, and angles. The order of the translation method was set according to a Complete Latin Square. Participants were informed that their goal was to translate the ball to make it touch the target object as quickly as possible. After successfully completing a translation within a trial, the system played an auditory cue i.e., a chime, and enabled a button displaying "Next", allowing the user to begin the next trial immediately. For every trial, the movable object, a ball with a radius of 0.25 m, appeared at the same position (as seen in Figure <ref type="figure">3a</ref>). After each trial, participants could take an optional break in case fatigue was felt. Participants had a maximum time of 90 seconds to complete a given trial.</p><p>For User Study 1, participants stood in a marked position and were told not to walk throughout the study. For User Study 2, participants were asked to begin each trial in a marked position and were encouraged to walk as it was suggested by the researcher during the training trials when explaining how to use a translation method.</p><p>After completing one translation method, participants would fill out a questionnaire to assess the usability and perceived workload of the translation type before using another translation method. After completing all three translation methods, participants engaged in a questionnaire to discuss their overall preferences, followed by a brief interview during which they explained the method they found most easy to learn, intuitive, and enjoyable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Participants</head><p>For User Study 1, we recruited 30 participants (18 males, 9 females, 3 preferred not to say / 23 right-handed, 3 left-handed, 2 ambidextrous, 2 unknown) with ages ranging from 19 to 36, with an average of 26.33 and a standard deviation of 4.33. The participants rated their video game experience at an average of 5.47 (&#963; = 1.94) and their mobile game experience at 4.4 (&#963; = 1.59) on a 1 to 7 scale.</p><p>For User Study 2, we recruited another 30 participants (22 males, 5 females, 3 preferred not to say / 24 right-handed, 4 left-handed, 2 unknown) with ages ranging from 18 to 36, with an average of 25.67 and a standard deviation of 4.44. The participants averaged a rating of 5.27 (&#963; = 2.00) for video games and 4.30 (&#963; = 1.58) for mobile games. No one participated in both User Studies 1 and 2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Objective Measures</head><p>We identified that all three objective measures were not normally distributed. Therefore, we applied the nonparametric aligned rank transform (ART) procedure <ref type="bibr">[56]</ref> before conducting repeated measures ANOVA. While we noticed potential outliers with longer completion time, we did not exclude them from the analysis, since the statistical analyses were based on ranks rather than values. These results were consistent across all conditions, suggesting no significant impact on the study results. When the sphericity assumption was violated in data, we applied Greenhouse-Geisser correction for &#949; &lt; 0.75 and Huynh-Feldt correction for &#949; &#8805; 0.75. We used the Bonferroni correction with an adjusted alpha level of 0.0167 (&#945; = 0.05/3). All statistical analyses were performed using SPSS V29.0.1.0 and RStudio V2023.09.1+494.</p><p>Overall, User Studies 1 and 2 showed consistent trends in the effects of translation techniques, target sizes, and distances on completion time, fine-tuning phase, and device movement, with slightly more active movement across all conditions in User Study 2 when participants were encouraged to walk. The statistical significance of each factor on each measure is reported in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1">Completion Time</head><p>For both user studies, all three studied variables had significant main effects on the completion time: translation technique, target size, and distance. Both user studies showed similar trends in results. Among the translation techniques, 3DSlide took a significantly longer time to complete tasks (see Figure <ref type="figure">5a</ref> and <ref type="figure">5j</ref>). In terms of target size and distance, completion time increased when the target was smaller and located at longer distances, following the relationship defined by Fitts' law (see Figure <ref type="figure">5d</ref>, <ref type="figure">5g</ref>, <ref type="figure">5m</ref>, and <ref type="figure">5p</ref>). An interaction effect between translation technique and size was observed for both user studies (see Figure <ref type="figure">6a</ref> and <ref type="figure">6d</ref>). While the overall trend of the smaller target size taking longer time across all techniques, and 3DSlide being the slowest were consistent across all sizes, there were disparities in whether the differences were statistically significant. For example, unlike VirtualGrasp and Joystick, 3DSlide had relatively small differences in completion time for different target sizes. An interaction effect between translation technique and distance was also observed for both user studies (see Figure <ref type="figure">7a</ref> and <ref type="figure">7d</ref>). Again, the overall trend of the longer distance taking longer to complete was identical across all techniques, and 3DSlide being the slowest across all distances was consistent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2">Fine-Tuning Phase</head><p>For both user studies, translation technique and target size showed significant main effects on the time spent in fine-tuning. This metric represents the portion of the total completion time associated with the area within the last 20% of the total distance to the target. If the results are closer to 20%, it can be inferred that the movement speed was relatively constant and linear. Exceeding 20% suggests a slower movement near the target location. Among the translation techniques, VirtualGrasp required a significantly greater proportion of time to fine-tune the object before it reached the target (see Figure <ref type="figure">5b</ref> and <ref type="figure">5k</ref>). While 3DSlide and Joystick spent around 20-30% of the total time in the region, VirtualGrasp spent more than 40%. Additionally, the final adjustment required greater effort for targets with smaller sizes (see Figure <ref type="figure">5e</ref> and <ref type="figure">5n</ref>). In terms of target distance, there was a slightly longer portion of time spent on final adjustments when the target location was located at its furthest of 4 m only in User Study 2 (see Figure <ref type="figure">5h</ref> and <ref type="figure">5o</ref>).</p><p>For both user studies, an interaction effect between translation technique and target size was observed (see Figure <ref type="figure">6b</ref> and <ref type="figure">6e</ref>). The general trend of smaller target sizes requiring a larger fine-tuning phase was consistent across translation techniques. Also, the finding that VirtualGrasp required the largest fine-tuning phase remained consistent regardless of target sizes. However, there were variations in whether these differences were statistically significant. Specifically, for both VirtualGrasp and Joystick, it was statistically significant that smaller target sizes required a larger fine-tuning phase when compared to larger target sizes, with differences between all pairs being statistically meaningful. In the case of 3DSlide, the small-sized target required a significantly larger fine-tuning phase compared to the large-sized target. However, the differences between small-sized and medium-sized targets, as well as between medium-sized and large-sized targets, did not reach statistical significance.</p><p>While target distance had minimal effects on fine-tuning phase, an interaction effect was identified between target distance and translation technique for both user studies (see Figure <ref type="figure">7b</ref> and <ref type="figure">7e</ref>). Although the differences were not substantial, 3DSlide tended to spend a similar or more time fine-tuning at a closer distance, whereas Joystick and VirtualGrasp tended to spend a similar or more time fine-tuning at a further distance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.3">Device Movement</head><p>For both user studies, significant main effects were observed for all variables on the level of device movement: translation technique, target size, and distance. VirtualGrasp resulted in the most active movement, followed by Joystick, and then 3DSlide (see Figure <ref type="figure">5c</ref> and <ref type="figure">5l</ref>). Regarding target size and distance, participants moved the device less on average when dealing with smaller targets and at longer distances (see Figure <ref type="figure">5f</ref>, 5i, 5o, and 5r).</p><p>Interaction effects between translation technique and size were observed in both user studies (see Figure <ref type="figure">6c</ref> and <ref type="figure">6f</ref>). In both studies, the amount of movement among different target sizes was not significantly different for 3DSlide and Joystick. However, for VirtualGrasp, both user studies consistently showed lower activity levels when targets were in smaller sizes.</p><p>Interaction effects between translation technique and distance were also identified in both user studies (see Figure <ref type="figure">7c</ref> and <ref type="figure">7f</ref>). The overall trend was consistent with the main effects observed on translation technique and distance. Specifically, VirtualGrasp was consistently associated with the highest level of activity, while targets located at greater distances resulted in lower average movement. However, there was no significant difference in activity levels for 3DSlide. Also, movement level drastically increased for VirtualGrasp, especially in closer distances, and therefore led to no significant differences between 1 m and 2 m.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Subjective Measures</head><p>For usability and workload data, we applied the nonparametric aligned rank transform (ART) procedure <ref type="bibr">[56]</ref> to address the non-normal distribution before conducting repeated measures ANOVA. When the sphericity assumption was violated in data, we applied Greenhouse-Geisser correction for &#949; &lt; 0.75 and Huynh-Feldt correction for &#949; &#8805; 0.75. For preference data and votes from exit interviews, the chi-square goodness of fit test was used to confirm statistically significant variance. Comments collected during exit interviews were categorized using an affinity diagram <ref type="bibr">[31]</ref> (see supplementary material of this paper) and are referenced in the Discussion section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1">Usability</head><p>Translation technique had a significant effect on manipulability in both user studies (see Figure <ref type="figure">8a</ref>). Manipulability assessed the physical comfort and ease of operation of the device and UI. In both studiesstationary or movable, Joystick was consistently rated as the most comfortable and straightforward to use. In User Study 1, 3DSlide received the second-highest score, followed by VirtualGrasp, with  statistically significant differences between all pairs. In Study 2, the difference between VirtualGrasp and 3DSlide was not statistically significant.</p><p>Comprehensibility assessed factors such as response speed, ease of reading and understanding, and consistency of the interface. Overall, all three translation techniques received higher scores than neutral, indicating positive marks in terms of comprehensibility. Once again, Joystick received the highest score, while 3DSlide received the lowest score. The difference was significant between 3DSlide and Joystick only in Study 2 (see Figure <ref type="figure">8c</ref>), not in User Study 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.2">Workload</head><p>The graphs in the second column of Figure <ref type="figure">8</ref> show the results of participants' perceived workload, assessed using the NASA-TLX questionnaire, which consists of six subsets: mental demand, physical demand, temporal demand, performance, effort, and frustration. Across all subsets and user studies, VirtualGrasp consistently received the highest scores for demand and the lowest score for perceived performance. In User Study 1, VirtualGrasp was rated to be statistically more complex (mental demand), strenuous (physical demand), demanding (effort), and stressful (frustration) in comparison to one or both of the other translation techniques. Additionally, participants expressed lower satisfaction with the performance of VirtualGrasp. Temporal demand, which gauged the perceived level of time pressure, did not exhibit any significant differences among translation techniques, with scores indicating neutral responses.</p><p>User Study 2 yielded a similar trend, with VirtualGrasp being evaluated as significantly more strenuous (physical demand), demanding (effort), and stressful (frustration) with lower satisfaction in performance compared to one or both other translation techniques. In contrast to User Study 1, VirtualGrasp was no longer significantly more demanding than the other two methods in terms of complexity (mental demand) and the pairwise comparison on frustration did not reach significance due to the conservative alpha.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.3">Preference</head><p>The graphs in the first column of Figure <ref type="figure">9</ref> show the distribution of preference ranks for translation techniques. A chi-square goodness of fit  test showed statistically significant discrepancies in the number of votes for both User Study 1 and User Study 2. In User Study 1, Joystick received the highest number of votes to be the most preferred translation technique. While Joystick remained the most preferred technique overall, it is noteworthy that VirtualGrasp received far more votes in User Study 2 compared to User Study 1, where participants were restrained in a stationary location.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.4">Exit Interview</head><p>In addition to general preference, participants were asked to vote for the method they thought was the most intuitive, easy to use, and fun (see Figure <ref type="figure">9</ref>-Right). In User Study 1, all three components showed statistically significant variances in the distribution of the votes (Easy to learn, Intuitive, Fun) with Joystick receiving the highest number of votes. Following Joystick, VirtualGrasp received more votes than 3DSlide for being intuitive and fun, while VirtualGrasp was rarely voted to be the easiest to learn.</p><p>User Study 2 showed distinctive results from User Study 1. For easiness to learn, Joystick received the most votes, but failed to reach statistical significance to reject the null hypothesis of equal distribution across all techniques. Also, VirtualGrasp received an equal number of votes as Joystick for being the most intuitive and one more vote than Joystick for being the most fun.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Fitts' Law</head><p>To apply Fitts' law and its extended models to our data, we calculated the average completion time across participants from User Study 1 for each unique combination of variables: target size, distance, azimuth angle, and inclination angle. This process provided 54 data points for each translation technique. We employed five models: Fitts's [14], Shannon's <ref type="bibr">[50]</ref>, Hoffman's <ref type="bibr">[24,</ref><ref type="bibr">36]</ref>, Murata and Iwase's <ref type="bibr">[42]</ref>, and Cha and Myung's <ref type="bibr">[9]</ref>. For Fitts' and Shannon's, the models aimed to predict completion time (MT) based on the index of difficulty (ID) determined by target size (W) and distance (A). Hoffman's model incorporated an additional factor, finger pad size (F). In this study, we referred to the size of the movable object. Murata and Iwase's model employed the same definition of ID as Shannon's but included the sine of azimuth angle as an additional factor. Lastly, Cha and Myung's model integrated both the sine of azimuth and the inclination angles, and employed the same ID as Hoffman's. For each translation technique, the same data points were used for all models to ensure equal comparison, given that the number of repetitions embedded in a single data point significantly impacts the explainability (R 2 ) of the model, as discussed by Triantafyllidis and Li <ref type="bibr">[53]</ref>. The fitted graphs of each model are available in the Fig. 9: The first column shows preference ranking on translation techniques, while the second column shows votes on the characteristics during the exit interviews.</p><p>supplementary material of this paper.</p><p>As shown in Table <ref type="table">2</ref>, Cha and Myung's model demonstrated the best fit for all three translation techniques, while Fitts' and Shannon's models yielded the lowest. Specifically, for VirtualGrasp and Joystick, models that incorporated angle(s) proved to be more effective at explaining the data compared to the 2D models. Interestingly, for 3DSlide, Hoffman's model emerged as the second-best performer in explaining the data, showcasing its unique characteristics in this context.</p><p>Table 2: R 2 values from Fitts' law and its extended model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head><p>3DSlide VirtualGrasp Joystick Fitts' <ref type="bibr">[14]</ref> 0.54 0.60 0.56 Shannon's <ref type="bibr">[36,</ref><ref type="bibr">50]</ref> 0.54 0.60 0.56 Hoffman's <ref type="bibr">[24]</ref> 0.71 0.65 0.66 Murata and Iwase's <ref type="bibr">[42]</ref> 0.58 0.69 0.68 Cha and Myung's <ref type="bibr">[9]</ref> 0.75 0.76 0.83</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">DISCUSSION</head><p>We designed an experiment to compare three different translation techniques to evaluate object translations in large, room-scale virtual environments. To do so, the translation techniques were evaluated at different target location distances and target sizes. Results show that the translation was completed the fastest with the Joystick condition, with 3DSlide requiring the least amount of fine-tuning and VirtualGrasp requiring the most device movement. Participants overall preferred Joystick the most. However, VirtualGrasp was significantly more preferred in User Study 2 when movement was encouraged, than in User Study 1 when user movement was not allowed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Variance of trade-off between speed and precision</head><p>Device movement was significantly high for VirtualGrasp when compared to the other two methods, with the amount of time spent in the fine-tuning phase being the longest. This indicates that VirtualGrasp can foster dynamic and physical behaviors at the price of limited accuracy more effectively than Joystick. When movement was allowed, VirtualGrasp was no longer significantly more complex or stressful compared to the other methods, as opposed to when users were restricted to a static location. Therefore, VirtualGrasp may be suitable for applications with user movement when accuracy is not the primary goal. Isaza et al. had a similar finding with proximic selection methods were more effective in fostering dynamic movement <ref type="bibr">[27]</ref>. In addition, VirtualGrasp can be intuitive and fun when users can walk freely. Snapping methods can compensate for the limitation in precise interaction if the precise target location is known <ref type="bibr">[34]</ref>. One possible explanation for VirtualGrasp taking longer than the other two for fine-tuning control could be attributed to users having to control multiple axes simultaneously. This is because how the mobile device is held dictates the object's position across the multiple axes. On the other hand, 3DSlide was the most restrictive (i.e., one axis control at a time), which may benefit the fine-tuning phase but overall slower time. P1-29 supported this on their preference of 3DSlide, saying "manipulating each axis individually offered more preciseness in moving the object around" and P2-21 stating, "it was the one that was easiest to control and offer the most precise control". However, having to switch between axes can impact efficiency. It seems that Joystick offers a good compromise for speed and accuracy, allowing simultaneous DoF control (X-Z axis) and one-axis control depending on how they control the UI. This performance advantage of simultaneous DOF control is supported by Benko and Feiner <ref type="bibr">[5]</ref>. Their method of a 3 DoF positioning technique restricted its axes to either 2 DoF or 1 DoF translation and needed no switch, similar to Joystick.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Users do not prefer to move too much</head><p>People overall preferred less physical movement, with Joystick ranking low in mental demand, physical demand, effort, and frustration, with the best-perceived performance based on NASA-TLX results. Participants liked Joystick because of the less movement required to translate an object, and their justification manifested through their exit interview. P1-1 stated that "I prefer the [Joystick] method as it does not require me to physically move..." and P2-20 also stated "it was the easiest to quickly get the answer with little to no movement". VirtualGrasp, which required participants to adjust their tablets with hand movements, was rated to have the highest mental demand, physical demand, effort, and frustration, with the worst perceived performance.</p><p>Results across User Studies 1 and 2 remained similar for the 3DSlide and Joystick conditions, as participants typically chose not to move despite being encouraged to move in User Study 2. Workload and preference were similar across User Studies 1 and 2 for 3DSlide and Joystick. The participants in User Study 2 frequently mentioned the usage of the depth cues as they did not feel the need to move to understand the position of the objects. When designing handheld VR interactions, depth cues allow users to align their perspective with the distance of virtual objects <ref type="bibr">[55]</ref>, especially in larger, room-scale translations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Device-based translation can be physically fatiguing</head><p>For VirtualGrasp, a frequent complaint from both user studies was that fatigue was experienced due to the device having to be held up for a long period of time, receiving the significantly lowest score in manipulability in User Study 1. In the other two, non-device-based methods, the orientation of the device did not impact the selected object's position. In User Study 1, due to participants being stationary, they often tilt the device and hold steadily to translate instead of adjusting perspective and distance toward the target in User Study 2. This fatigue is quite common in device-based methods <ref type="bibr">[5,</ref><ref type="bibr">18,</ref><ref type="bibr">57]</ref> as users have to hold their arms steady to "aim" at the target correctly. In VirtualGrasp, given that the device's orientation and position dictate the movable object's position, any shifts in the device's holding gesture will change the object's position proportionately to the object's distance. P1-17 statement well represents this struggle: "I didn't like the [VirtualGrasp] ... I had to be very steady while grabbing the object otherwise, if the tablet moved even slightly, I would lose the correct position; I felt it needed too much physical effort compared to the other two". This result is inconsistent with previous work, which found HOMER to be effective translation for VR HMD <ref type="bibr">[7]</ref>. While device-based methods are often the easiest and most intuitive methods to control translation in VR <ref type="bibr">[29]</ref>, the coupling between the device orientation and the grabbed object position makes VirtualGrasp more challenging than HMD VR. Future designers should take this into consideration by restricting device-based techniques on tasks that require high levels of precision to avoid fatiguing users.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">When free to move, device-based control is preferable</head><p>According to preference results, more participants preferred VirtualGrasp more in User Study 2, when movement was allowed, and overall, positive impressions increased in User Study 2. Evidence can be seen in User Study 1 as VirtualGrasp had a significantly higher mental demand and frustration than the other two techniques, which was not the case in User Study 2. More participants also described VirtualGrasp as intuitive in User Study 2 than in User Study 1. The results of Mossel et al. indicated that users preferred HOMER-S for object translation tasks as opposed to 3DTouch in 3D translation and 2D translation tasks, given that it is similar to the real-world metaphor of translation, i.e., grabbing and move <ref type="bibr">[41]</ref>. As VirtualGrasp was derived from HOMER-S, and that the original evaluation of HOMER-S allowed for participant movement <ref type="bibr">[41]</ref>, device-based methods are preferable in scenarios in which movement is allowed, given that participants can use the real-world translation metaphor. The results from our study align with the room-scale handheld AR evaluation from Hellmuth et al. <ref type="bibr">[23]</ref>; despite the worst result of their device-based technique, the participants noted that they "had fun with this method". More participants in User Study 2 found VirtualGrasp intuitive to use than in User Study 1. Increased visual cues from being able to see and grab the object in a different location could have contributed to matching the perception and control structure <ref type="bibr">[28]</ref>.</p><p>The implications of this study allow designers to gain insights into which translation techniques to use when creating room-scale handheld VR experiences. For example, if the goal of the application is to encourage challenging, dynamic, and physical behaviors for engagement (e.g., game), VirtualGrasp can be a good option. Whereas if it is an application where efficiency and precision matter, Joystick is rather a go-to choice. These implications help the handheld VR/AR developers understand what can be improved as mobile computing technology grows to allow for better tracking and interaction capabilities given increased space scale (e.g., outdoor, multiuser indoor games).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.5">Handheld VR follows Fitts' law in 3D environment</head><p>Overall, we observed that all three translation methods followed the fundamental principles of Fitts' law, with greater completion time when target sizes were smaller and when the distance to the target was greater. At the outset of this study, we posed the question of whether the translation task in handheld VR would be better explained in 2D or 3D settings, considering the 2D input system and 3D task environment. For all studied translation techniques, Cha and Myung's model exhibited the best predictability by including both inclination and azimuth angles between the object and the target location. This suggests that while the interaction occurred through an interface laid on a handheld 2D device, it more likely resembled natural movement in a 3D environment. Namely, a higher inclination angle required a broader tilt of the device and the arm holding it, and a larger azimuth angle required a bigger movement upward, both of which resulted in longer completion times. While VirtualGrasp and Joystick showed that 3D models outperformed those solely based on target size and distance, 3DSlide demonstrated its second-best fit with Hoffman's model, which does not consider any angle. This suggests that the impact of angles on predictability was relatively lower for 3DSlide. One explanation could be that with 3DSlide, object translation was limited to 1 DoF at a time, resembling multiple 1D translation tasks. In light of these findings, designers should acknowledge the importance of understanding both the environment and the characteristics of the translation technique to gain insight into the underlying dynamics of the task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">LIMITATIONS AND FUTURE WORK</head><p>Our study evaluated different translation techniques in room-scale handheld VR and how their design could affect usability. Across translation techniques, various components changed such as UI, input mechanism, and translation sensitivity, and it is not clear which factor drove the performance difference. Instead, our work evaluates three translation methods that are commonly studied in literature and used in practice. In addition, there are many other factors that could have affected the performance but we did not consider for the study, including angles, object sizes, reachability, and virtual environment design. Future work could directly analyze the implications made from the effects of each translation type to determine which design factor influenced translation.</p><p>For Joystick, a possible alternative design of two joysticks could have been distributed to two sides. Having the height control on the right side could contribute to a difference in Joystick's performance.</p><p>We acknowledge various topics that were not within the scope of this work but are still necessary to have a comprehensive understanding of room-scale handheld VR interaction. While we believe that scaling and rotation would not change drastically for room-scale environments, future work could explore how scale can impact the precision of these transformations. Future investigations can delve into additional translation techniques. One of these was gesture-based interactions, which was not examined due to the current technological limitations on a single handheld device and its limited field of view. However, future devices with sensors that can capture hand gestures without having to move their hands in front of the camera, could evaluate the viability of incorporating gestures in a handheld VR.</p></div></body>
		</text>
</TEI>
