<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Learning-based Spotlight Position Optimization for Non-Line-of-Sight Human Localization and Posture Classification</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>01/03/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10537757</idno>
					<idno type="doi">10.1109/WACV57701.2024.00417</idno>
					
					<author>Sreenithy Chandran</author><author>Tatsuya Yatagawa</author><author>Hiroyuki Kubo</author><author>Suren Jayasuriya</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Non-line-of-sight imaging (NLOS) is the process of estimating information about a scene that is hidden from the direct line of sight of the camera. NLOS imaging typically requires time-resolved detectors and a laser source for illumination, which are both expensive and computationally intensive to handle. In this paper, we propose an NLOS-based localization and posture classification technique that uses an off-the-shelf projector and camera system. We leverage a message-passing neural network to learn a visible scene geometry and predict the best position to be spotlighted by the projector that can maximize the NLOS signal. The neural network is trained end-to-end and the network parameters are optimized to maximize the NLOS performance. Unlike prior deep-learning-based NLOS techniques that assume planar relay walls, our system allows us to handle line-of-sight scenes where scene geometries are more arbitrary. Our method demonstrates state-of-the-art performance in object localization and position classification using both synthetic and real scenes.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Non-line-of-sight (NLOS) imaging refers to the technique of imaging hidden parts of a scene that are not within the field of view of a camera. This involves interpreting the illumination reflected/scattered from the NLOS object onto visible surfaces. NLOS imaging has been employed for the identification, tracking and 3D shape reconstruction of hidden objects. NLOS imaging techniques are rapidly developing <ref type="bibr">[11]</ref> and currently have numerous applications, such as search and rescue <ref type="bibr">[43]</ref>, endoscopy <ref type="bibr">[26]</ref>, and hidden pedestrian detection for autonomous driving <ref type="bibr">[2]</ref>.</p><p>NLOS imaging was first demonstrated by Velten et</p><p>LOS mesh LOS Wall Photo (a) Capture setup (b) Entire room with MoCap cameras (c) Processing pipeline Spotlight optimization NLOS Network ...</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MPNN</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Moving around NLOS region</head><p>Tracking Plot <ref type="bibr">Figure 1</ref>. Given the polygonal mesh of a target scene, our method predicts which area of the scene to illuminate with a spotlight and maximize light scatter information from a hidden person. Then, we capture RGB images of the wall visible from the camera under optimal illumination. Finally, our neural network predicts the 2D position and posture of the hidden person.</p><p>al. <ref type="bibr">[41]</ref> using an ultra-fast laser and a streak camera. Subsequent research in transient imaging leveraged a pulsed laser with high-resolution temporal detectors such as singlephoton avalanche diodes (SPADs) <ref type="bibr">[4,</ref><ref type="bibr">30,</ref><ref type="bibr">31,</ref><ref type="bibr">43]</ref>. Active transient imaging pulses a fast laser into the scene and measures the time that the photon takes to arrive back at the temporal detector. However, high temporal resolution with SPADs requires precise calibration and long acquisition times. Furthermore, the time efficiency of processing SPAD data processing is insufficient for large scenes and highresolution images <ref type="bibr">[25,</ref><ref type="bibr">43]</ref>. Another alternative is to use continuous wave Time-of-Flight(ToF) cameras with modulated light sources <ref type="bibr">[15,</ref><ref type="bibr">17,</ref><ref type="bibr">27]</ref>. ToF cameras are cheaper than streak cameras and SPADs and are popular in real-time NLOS applications when high resolution is not needed <ref type="bibr">[27]</ref>.</p><p>Cameras are by far the cheapest detectors, albeit lacking</p><p>This WACV paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore. information on light transport, such as ToF of light. Therefore, researchers have also explored NLOS techniques using conventional cameras and lasers <ref type="bibr">[8,</ref><ref type="bibr">16,</ref><ref type="bibr">20,</ref><ref type="bibr">23]</ref> and ambient illumination <ref type="bibr">[1,3,33&#177;35,39]</ref>. Recently, the use of scene priors and deep learning has become popular to overcome the ill-posedness of NLOS imaging problem <ref type="bibr">[17,</ref><ref type="bibr">29,</ref><ref type="bibr">33]</ref>.</p><p>In this work, we present an active data-driven NLOS posture classification and tracking pipeline that works with a standard RGB camera and single spotlight illumination. Our approach does not require optical alignment or system calibration. It combines a graph neural network with a physics-based differentiable renderer to optimally determine a spotlight position to maximize NLOS performance. The goal of illumination estimation is to learn the best illumination direction that maximizes the NLOS radiance that reaches the camera, since we have knowledge of the LOS geometry. We leverage this to improve downstream NLOS imaging tasks. A major focus of our method is to move beyond small-scale imaging setups with line-of-sight(LOS) walls/ visible surfaces that are mostly planar to work across scenes that are practically present in the real world. Chandran et al. <ref type="bibr">[7]</ref> proposed an approach to handle LOS scenes with occlusions. However, their imaging model assumed diffuse reflectance for the LOS wall and handled scenes with limited complexity and very small NLOS volumes (30cm &#215; 30cm &#215; 30cm). We build a large dataset of realistic looking synthetic scenes with complex geometry, textures, occlusions, etc. for this purpose. We also captured highly accurate real data with human NLOS subjects and validated our method using this dataset. Our specific contributions include the following:</p><p>&#8226; An end-to-end neural computational imaging method to learn the best illumination for a LOS scene mesh to maximize NLOS performance. Our pipeline consists of a novel message-passing neural network for estimating spotlight position, a physics-based renderer, and a neural network for NLOS localization/posture classification. &#8226; Owing to the use of differentiable rendering in our pipeline, the proposed method works significantly well for realistic-scale scenes with non-diffuse surfaces and self-illuminating objects. &#8226; We used synthetic and real data to demonstrate superior performance compared to several baselines.</p><p>Our method achieves a highly accurate localization of unknown human subjects. We surpass the best competing methods by more than 45 cm in terms of root mean square error. Compared to methods that use only a single-intensity LOS wall image, our method based on optimizing the spotlight has clear advantages, as shown by experimental results and ablative studies. Check our project page for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Active illumination in NLOS: Active illumination methods employ controlled illumination sources (e.g., lasers and projectors) and detectors to explore the hidden parts of scenes. Kirmani et al. <ref type="bibr">[18]</ref> proposed the first framework for transient imaging to &#170;look around the corner.&#186; Velten et al. <ref type="bibr">[41]</ref> introduced a backpropagation technique for NLOS scene reconstruction, this was later used in gated systems <ref type="bibr">[22]</ref> and SPADs <ref type="bibr">[4]</ref>. Furthermore, the non-impulse illumination was also shown worthy for NLOS tasks <ref type="bibr">[19]</ref>.</p><p>Passive illumination in NLOS: Passive illumination methods <ref type="bibr">[1,</ref><ref type="bibr">3,</ref><ref type="bibr">21,</ref><ref type="bibr">28,</ref><ref type="bibr">36]</ref> employ ambient light for NLOS imaging tasks. For instance, some considered the objects in the scene as pinspecks or pinholes <ref type="bibr">[33,</ref><ref type="bibr">34,</ref><ref type="bibr">39]</ref>, while others utilized occluders <ref type="bibr">[3,</ref><ref type="bibr">45]</ref>, such as doorways <ref type="bibr">[21]</ref>, to reconstruct the hidden scene. Moreover, Sharma et al. <ref type="bibr">[36]</ref> leveraged raw signals from a LOS wall to perform NLOS tasks, while Medin et al. <ref type="bibr">[28]</ref> leveraged cast shadows of objects on LOS diffuse walls and inferred biometric information of humans in an NLOS region.</p><p>Deep learning for NLOS: For NLOS tasks, deep learning techniques have been used with both ToF and conventional RGB data. Carmazzo et al. <ref type="bibr">[6]</ref> introduced a neural network, which was trained with the data captured using a SPAD setup, to perform localization and identification tasks. Chen et al. <ref type="bibr">[9]</ref> proposed a deep-learning-based method that uses scene priors. They trained a neural network using a differentiable transient renderer to perform the NLOS imaging tasks. Xu et al. <ref type="bibr">[44]</ref> performed human pose recognition for a transient NLOS dataset characterized by the confocal NLOS model. Chen et al. <ref type="bibr">[8]</ref> utilized a Unet-like architecture to reconstruct the scene geometry from steady-state NLOS data. Cao et al. <ref type="bibr">[5]</ref> introduced the CNN-Based NLOS Localization Under Changing Ambient Illumination (NLOS-LUCAI). He et al. <ref type="bibr">[13]</ref> introduced a deep learning framework for simultaneous real-time imaging and tracking of dynamic targets using an RGB camera.</p><p>The work closest to ours is by Chandran et al. <ref type="bibr">[7]</ref>. They proposed an adaptive lighting framework using physicsbased optimization, estimating where on a LOS wall the projector should illuminate to maximize NLOS information. They also proposed a deep learning-based approach to predict the locations of NLOS objects from intensity images. They, however, worked with only approximately planar diffuse walls with small NLOS region dimensions. In contrast, our work goes beyond this to handle walls with complexities, occlusions, and varying materials.</p><p>Differentiable rendering for NLOS: The utilization of differentiable rendering has been increasing in recent times, especially for the purpose of analysis-by-synthesis (AbS), also known as inverse rendering. Klein et al. <ref type="bibr">[20]</ref> used AbS to track NLOS objects, formulating the problem as a non-linear optimization based on data from light transport simulation and real measurements. Tsai et al. <ref type="bibr">[40]</ref> employed a SPAD setup to simultaneously acquire NLOST objects' shape geometries and reflectance properties in the AbS manner. We propose an end-to-end approach that utilizes a differentiable path tracer to transmit information from the image domain to the polygon mesh that represents the scene domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head><p>This section outlines our proposed method. Section 3.1 describes our problem statement. Then, we describe the proposed processing pipeline consisting of three components. Section 3.2 introduces the first component, the Illumination Estimation Network (IEN), a graph neural network that estimates the optimal lighting position to maximize the quantity of NLOS information. Section 3.3 discusses the second component, a differentiable rendering engine that uses the illumination information given by the first component. Section 3.4 describes the last component, a neural network that involves estimating the position and posture of the human subject from the RGB picture calculated by the second component.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Problem Statement</head><p>Our imaging system consists of a projector P as our illumination source and a camera C as our detector. We use the projector only to illuminate a single spot, as opposed to projecting spatially varying illumination. The imaging system is positioned without direct field of view over the NLOS object as shown in Figure <ref type="figure">2</ref>. We represent the visible surface as a polygonal mesh. The light from the projector P hits the visible surface at, triangle t = (v 0 , v 1 , v 2 ), then reaches the NLOS object O before returning to the LOS surface at another triangle t &#8242; , and finally captured by the camera C. The hidden NLOS object has a location l = (x, y) and a posture associated with it. We restrict our attention to light effects from three-bounce paths of the form, P &#8594; t &#8594; O &#8594; t &#8242; &#8594; C, which represents a path connecting the source P and camera C interacts with the NLOS surface only once, as shown in Fig. <ref type="figure">2</ref>. This simplification is motivated by previous observations that photons following higher-order paths are difficult to detect using existing sensors. The image of the LOS surface I is related to the location of an NLOS object l = (x, y) by a function F , i.e.,</p><p>where &#945; refers to the position of the illumination on the LOS surface and &#981; refers to the other parameters that affect the captured image, such as the material of the LOS surface, NLOS subject posture, and noise. The forward function F models the light transport matrix of the setup. The goal of our study is to invert this function F and optimize &#945; to more accurately recover the object location l.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Illumination Estimation Network</head><p>The active light source used for the NLOS problem plays an important role in improving the signal-to-noise ratio (SNR) of the NLOS information, as demonstrated by Chandran et al. <ref type="bibr">[7]</ref>. The primary question that our study aims to address is finding out where to shine the spotlight on a visible surface. To address this, we introduce an illumination estimation network (IEN). The IEN takes a mesh of a scene and outputs the nodes of the triangle that have to be illuminated, to maximize NLOS information. Here, the size of the mesh and the relative camera position against the visible surface can be specified arbitrarily by a unit distance that does not necessarily need to correspond to physical units (e.g., centimeters and millimeters).</p><p>The IEN is based on a message-passing neural network (MPNN) of Gilmer et al. <ref type="bibr">[12]</ref> to handle the LOS meshes of arbitrary sizes. We represent the input LOS mesh (acquired through 3D scanning in practice) as a triangle mesh M = (V, F ), where V and F correspond to sets of vertices and faces, respectively. A 3D mesh is transformed into a graph G = (X, A) where X has dimension (|V |, 3) and defines the spatial xyz-features for each node, and the adjacency matrix A with dimension (|V |, |V |) defines the connected neighborhood of each node. The vertex attributes of the graph are passed to a multilayer perceptron (MLP), i.e., the vertex-wise feed-forward network, to obtain the vertexlevel features. Then, the output from this encoder is passed to the graph convolutional network of Verma et al. <ref type="bibr">[42]</ref>. The feature update in each graph convolution layer is given as LOS wall (mesh) Position-wise Feed-forward MPNN Vertex Feature Aggregation Face-wise Feed-forward Softmax Predicted Face to be Spotlighted ... ... where b is a bias vector, &#945; </p><p>where u</p><p>(l) m and c</p><p>(l) m are learnable parameters, specific to each layer l. The attention coefficients are normalized so that they sum to 1, i.e.,</p><p>The encoded node-level features are then sequentially passed through a stack of three feature-steered convolutional layers. Each of these layers aggregates messages from two attention heads. The two labels correspond to either light on or light off. Finally, the refined node-level features are passed the prediction block built with an MLP, which outputs the probability of how likely each triangle should be spotlighted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Physics-Based Differentiable Rendering</head><p>We exploit a differentiable renderer in our proposed pipeline. Our rendering engine is built upon &#170;redner&#186; <ref type="bibr">[24]</ref>, a differentiable renderer based on edge sampling. With this engine, we can obtain an RGB picture of the LOS surface visible from the camera through physically-based rendering in a differentiable manner. Since our pipeline is trained endto-end, the differentiable path tracer is essential to backpropagate the image-domain features to the mesh domain. In our case, the goal of the renderer is to compute the gradients of an illuminated LOS surface with respect to the position of the light used in the illumination. This offers the core of our contribution, identifying the best position at which a spotlight should shine on the LOS surface.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">NLOS Network</head><p>The goal here is to perform NLOS localization and posture classification, that is, we identify the posture performed by the human and also obtain the 2D location of the person. We assume that the position of the light is largely based on the location of the human and not the posture performed by the hidden human. Thus, to train our pipeline, we use the mean square error (MSE) between the predicted loca-tion l &#8242; and the NLOS ground truth location l. But, as shown in Fig. <ref type="figure">3</ref>, we also have an NLOS subnetwork that predicts posture based on input RGB images, for which we use the standard cross-entropy loss between the predicted posture label and the ground-truth posture label. We use a ResNet-18 <ref type="bibr">[14]</ref> as our feature extractor, this is then fed to two subnetworks. For both the tracking and posture classification tasks, we use an MLP decoder, which consists of three fully connected layers with a ReLU activation and is followed by the last fully connected layer that outputs the (x, y) coordinates. On the contrary, the last layer is activated with the softmax function for posture classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Training and Inference</head><p>All of our training is done on synthetic data, and the inference performance is evaluated with both synthetic and real data. Refer to Sec. 4.1 for details of the simulated data used for the training. During training time, the entire pipeline including the IEN, differentiable renderer, and NLOS network is trained from end to end. During inference on real data, we used the trained weights of the IEN to estimate where the spotlight should be placed. Refer to Sec. 4.2 for specific details of real data capture. The captured LOS mesh is decimated and then passed into the IEN which gives the estimate of the spotlight position. After that, we proceed to capture an RGB image of the visible surface with the given illumination. Lastly, the RGB image is passed to the NLOS network to obtain localization or posture classification results. During inference, our method takes about 7ms per estimation on average to process the RGB input and output the posture classification and tracking predictions. More details are available in the supplemental material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Dataset</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Simulated Data</head><p>Our goal for training was to generate a dataset that is close to real-world scenarios both in terms of realism (textures, compositions, objects, occlusions, etc.) and also in terms of scale. We generated 30 LOS scenes for this purpose. Most learning-based active or passive NLOS methods make use of very small-scale NLOS setups and NLOS objects. For example, NLOS objects, such as a 3D-printed bunny, dragon, etc., have been conventionally used in the imaging community. They are not as realistic as the variety of objects that can be found in real-world settings. Thus, to enhance the realism of synthetic scenes, we collect publicly available meshes and arrange them in the scenes using SolidWorks.</p><p>Since our goal is posture classification and localization of human subjects, we use human models to generate our data. We perform similar activities to the ones in Sharma et al. <ref type="bibr">[36]</ref>. This includes standing, sitting, crouch-ing, hands at 90 &#8226; with respect to the floor, and hands at 45 &#8226; (to mimic waving). In addition to this, we also have random gestures that are generated for classification as unrecognized activity.Generating a great deal of scene data is a lengthy process. To increase the number of synthetic scenes, we have implemented a data augmentation step in Blender. We have created a plugin for Blender to do this, the specifics of which are in the supplemental material.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Real Data</head><p>We collected a real-world dataset consisting of 8 indoor scenes with 5 human subjects of varying heights between 5.0&#177;6.2 ft. This includes LOS surfaces in classrooms, conference rooms, storage rooms, and bedrooms. Some sample scenes are shown in the supplemental document. The subject was at a distance of approximately 2.0&#177;8.0 ft from the LOS surface. We used an InFocus IN3138HDa projector to create a spotlight illumination and a Sony &#945;6000 mirrorless camera to capture the illuminated LOS scene. We also considered the presence of ambient light while adjusting the exposure parameters.</p><p>LOS mesh: We use the Polycam LIDAR capture feature app on the iPhone 13 Pro to get a mesh of the visible surface. The LIDAR sensor on the iPhone has a maximum range of 5.00 m. The captured LOS meshes originally consisted of 3000&#177;10,000 vertices, depending on the complexity of the wall. These meshes were decimated to consist of 500&#177;1000 vertices to reduce computational complexity.</p><p>Ground truth acquisition: We used the OptiTrack motion capture system to get high-quality localization as ground truth values with 0.50 mm precision. The human subjects wore a suit with IR markers for motion capturing. To increase the diversity of data, we also captured several indoor scenes without a motion-capture rig. This was performed with a USB camera on the ceiling and an ArUco marker put on the subject's head. Given the marker size in the image captured by a camera with calibrated intrinsic and extrinsic parameters, we obtained the 2D position of the subject in an NLOS region using off-the-shelf pose estimation software.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>In this section, we will cover the training specifics, the metrics used to assess the performance of our approach, and the competing methods we compared it to. We will then present the results of our proposed method and provide a more in-depth analysis of it. Here are several assumptions that we made in our experiments. When we shine a light on the spot proposed by the IEN, we manually focus the projector on that spot, although there could be some illumination on adjacent triangles too. For all of our experiments, we consider that there is only one human subject acting around the NLOS region at a time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Training Details</head><p>The pipeline is implemented using PyTorch, where the graph convolutional network is constructed using the MessagePassing module provided by the PyG library <ref type="bibr">[10]</ref>. We train the network using Adam optimizer with a learning rate of 10 -2 with a weight decay of 10 -5 . On a computer with two NVIDIA GTX 1080 Ti graphics cards, the training takes approximately two days. Note that the test data set consisted of LOS surfaces that were not present in the training data set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Comparisons</head><p>Since our method works on RGB image inputs to the NLOS network, it fundamentally distinguishes it from methods based on SPADs. Given this difference in the input data, a direct comparison with SPAD-based methods would not provide meaningful insights. Instead, we compare our method against the following state-of-the-art methods, chosen specifically for their similarity in acquisition setup. RGB Images: We directly used the RGB images without active illumination (only ambient light) in the scene to train the proposed NLOS localization/posture classification networks. The goal of this baseline is to reveal the importance of the IEN of our method. Adaptive Lighting: This setup is the one presented by Chandran et al. <ref type="bibr">[7]</ref>, which proposes an adaptive lighting method to determine which one or more spots of light should be focused on the scene. This approach uses an optimization technique rather than our learning-based approach to determine where to shine the light. We leverage the code shared by the authors for our implementation. For all the NLOS scenes in our training dataset, we used only the LOS geometry and obtained the best illumination patch to shine light on according to their method. Then we render the scenes in the dataset using the given illumination and train on their CNN architecture. Flash Photography: This setup is the one presented by Tanick et al. <ref type="bibr">[38]</ref>, they use a regression network and generative network for NLOS-based scene reconstruction. This setup is similar to ours in terms of involving a flashlight and a normal RGB camera. We re-implement the network described in <ref type="bibr">[38]</ref> and train it in our data set. Their regression network performs both localization and classification, and we adapt the architecture of the classification network so that the last layer accounts for our 6 posture classes. DL-NLOS: This setup is the one presented by He et al. <ref type="bibr">[13]</ref>, which introduces deep learning for NLOS localization solely on RGB images captured under ambient illumination. Their localization consists of five convolutional layers followed by three fully connected layers. We reimplement their proposed network architecture based on the details in the paper. We update the last layers with softmax to perform posture classification as well. We have made adjustments to the baselines to the best of our ability to match and adopt to our problem statement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Posture Classification Performance</head><p>The full results for both synthetic and real data are shown in Table <ref type="table">1</ref>. Our posture classification network identified human posture with 96.1% accuracy for 10 unknown LOS synthetic scenes. The average performance by RGBonly training is 53.2%, flash photography method is 67.2%, this was bettered by He et al. <ref type="bibr">[13]</ref> with 73.9% and Chandran et al. <ref type="bibr">[7]</ref> by about 74.8%. For real scenes, our method has a classification accuracy of 87.4%, and the closest bestperforming methods were <ref type="bibr">[7,</ref><ref type="bibr">13]</ref> with 70.4% accuracy. Refer to supplemental material for further analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Localization Performance</head><p>We evaluate the accuracy in localization using the average distance (i.e., localization error) between the ground truth positions in the moving trajectory and those predicted by our method. For both synthetic and real data, Tab. 2 shows comparisons of our method with competing methods, where the average distances are denoted in units of centimeters. The average localization error for synthetic scenes for our method is 6.33 cm for subjects performing known activities, while 9.86 cm for subjects performing unknown activities that the network did not see during training. For real scenes, the errors for known and unknown activities are 31.45 cm and 45.14 cm, respectively. Compared to baseline methods, errors are 124.76 cm for RGBonly training, 87.09 cm for the adaptive lighting method <ref type="bibr">[7]</ref>, 100.83 cm for flash photography <ref type="bibr">[38]</ref>, and 85.90 cm for DL-NLOS <ref type="bibr">[13]</ref>. According to these results, the performance improvement over the network trained only on RGB images validates the importance of the IEN. Moreover, our</p><p>(a) Bird's eye view (b) LOS wall (c) Trajectory plots  method outperforms all competing methods and, furthermore, its accuracy surpasses that of the best of the competing methods by more than 50.00 cm for both known and unknown activities. Fig. <ref type="figure">5</ref> shows tracking trajectories obtained by our method for real data. We have included real video test results in our supplemental video. It should be noted that the trajectories of our method in the figure have been smoothed by the Savitzky&#177;Golay filter <ref type="bibr">[32]</ref> to improve the estimation of the trajectory by refining the noisy raw output from the network. This smoothing operation is a practical step, which can be seamlessly integrated into our system, making it a justified part of the evaluation process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">Importance of Spotlight Position Optimization</head><p>We assess the contribution of the IEN to the NLOS task by conducting an additional experiment. We compare our Table <ref type="table">3</ref>. Results of ablative studies to validate the effectiveness of the illumination predictions. We group the results based on the average trajectory error and average posture classification across all the test data. The localization metrics are presented in units of centimeters, and the correct recognition ratios are in units of %. The results shown here are for simulated data.</p><p>Task IEN +CNN AL +ResNet Random +ResNet Center +ResNet Ours Localization (&#8595;) (Average Error [cm]) 10.71 14.23 19.16 22.63 8.10 Posture Classification (&#8593;) (Accuracy [%]) 91.28 79.84 67.36 64.09 96.13 (a) Sample Scene (b) Our Method (c) Adaptive Lighting (d) Center of LOS (e) Random Illumination method with the following four alternatives.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IEN+CNN:</head><p>We construct a model that comprises the IEN followed by the CNN for localization and classification proposed by Chandran et al <ref type="bibr">[7]</ref>.</p><p>AL+ResNet: We use the spotlight position selected by adaptive lighting <ref type="bibr">[7]</ref> and use that as input to the NLOS network consisting of ResNet+MLP used in our proposed method. It is assumed that the walls of the line-of-sight (LOS) are diffuse, as is the case with the adaptive lighting technique.</p><p>Random+ResNet: We also compare with alternatives in which the location of the spotlight is selected randomly somewhere on the LOS surface. For NLOS tasks, the same network consisting of our ResNet+MLP is utilized.</p><p>Center+ResNet: As with the above, we also compare an alternative in which the spotlight always illuminates the center of the field of view. Again, the same network consisting of ResNet + MLP as ours is utilized for NLOS tasks.</p><p>The visual comparison of our method and the following alternatives is shown in Fig. <ref type="figure">6</ref>. Table <ref type="table">3</ref> shows the comparison between the proposed method and these alternatives. This table demonstrates that our approach is superior to the other options, indicating that our method was successful in identifying the most suitable area to illuminate, resulting in maxmimizing NLOS signal to the detector. Obviously, the proposed method outperforms Random+ResNet and Cen-ter+ResNet which are based on simple heuristics.</p><p>Our method also outperforms AL+ResNet, the adaptive</p><p>0 0.2 0.4 0.6 0.8 1 93 94 95 96 Decimation Factor Accuracy [%] Posture Classification Avg. Classification Accuracy 0 0.2 0.4 0.6 0.8 1 8 9 10 11 12 Decimation Factor Mean Positional Error [cm]</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Localization</head><p>Avg. trajectory error <ref type="bibr">Figure 7</ref>. Plot shows the effect of decimation on posture classification accuracy and average trajectory error. The decimation factor varies between 0.1 to 1, where 1 refers to the highest resolution of capture and 0.1 refers to decimating the total number of vertices in the LOS mesh is reduced by a factor of 10.</p><p>lighting method <ref type="bibr">[7]</ref> extended by our NLOS network. The lower performance of AL + ResNet suggests that the diffuse assumption by <ref type="bibr">[7]</ref> does not work well when the scene includes specular surfaces (e.g., mirrors, glasses), metallic surfaces, translucent materials (e.g., wax, plastics), and strongly textured surfaces. The adaptive lighting method is indeed prone to shining a light on a position on the diffuse surface. For example, the bottom row of Fig. <ref type="figure">6</ref> shows that the adaptive lighting <ref type="bibr">[7]</ref> overlooks the refrigerator on the right, which may reflect more light. In contrast, our method appropriately shines the light on the refrigerator, which reflects the light from the NLOS object the most. Clearly, when IEN+CNN and our method are compared, it is evident that the ResNet backbone does improve the NLOS performance of our method. We did not conduct any experiments to evaluate the effects of different network backbones, feature extractors, etc. on the NLOS task, as our aim is to demonstrate the significance of spotlight optimization. Also, it must be noted that, the size of spotlight is directly related to the area of the decimated patch that has to be illuminated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6.">Effect of Mesh Decimation</head><p>To understand how the mesh resolution of the LOS area affects the NLOS performance, we alter the resolution of the scene mesh at different ratios by decimating it. The LOS meshes we captured have a diverse number of vertices, as described in Sec. 4.2. To ensure a fair evaluation across the test set, we select LOS meshes with approximately the same number of vertices (i.e., 9000&#177;11000 vertices), and reduce them up to about one-tenth of their original size (approximately 1000 vertices). The meshes with different resolutions are then input to the pipeline. The experimental results in Fig. <ref type="figure">7</ref> show that the performance of the NLOS task does not increase significantly only by using a high-resolution mesh. It is attributed to the increasing difficulty of obtaining an adequate feature from a higher-resolution mesh. This observation suggests that the original high-resolution meshes contain much more geometric details than what is required to interpret the scene geometry. Therefore, we may decrease the mesh resolution to approximately 50% of the original, where the geometric details are visually retained. This also indicates the robustness of our technique to the accuracy of the LOS scan.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this work, we demonstrate the importance of choosing the optimal position to be illuminated in an active LOS imaging system using a projector and standard RGB camera. We verified our method with synthetic and real data from real-world scenes with a human in the NLOS region. We showed the proposed method's state-of-the-art tracking and posture classification performance in challenging scenarios where the LOS region may be partly occluded and consist of components with non-diffuse materials. The proposed method was successful in posture classification for unknown real-world scenes, achieving an accuracy of approximately 87%. It also achieved a highly accurate localization of unknown human subjects moving around the NLOS region, with a root mean square error of approximately 45 cm.The localization error of our method is approximately one-half of those obtained by the best of the state-of-the-art methods that we compared. These results highlight the importance of optimizing the position of the spotlight, the primary focus of this study.</p><p>For future work, we plan to explore the use of spatially varying illumination that could be more optimal than a single spotlight. The NLOS region size that can be handled by our method is currently limited by the low SNR signals from the NLOS objects. Hence, our method was tested only on a single human subject in the NLOS region. To overcome the limitation of subject type and number of subjects, we would like to investigate incorporating computational imaging hardware into the end-to-end optimization loop <ref type="bibr">[27,</ref><ref type="bibr">37]</ref>.</p></div></body>
		</text>
</TEI>
