<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Relighting Neural Radiance Fields with Shadow and Highlight Hints</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>07/23/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10519209</idno>
					<idno type="doi">10.1145/3588432.3591482</idno>
					
					<author>Chong Zeng</author><author>Guojun Chen</author><author>Yue Dong</author><author>Pieter Peers</author><author>Hongzhi Wu</author><author>Xin Tong</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This paper presents a novel neural implicit radiance representation for free viewpoint relighting from a small set of unstructured photographs of an object lit by a moving point light source different from the view position. We express the shape as a signed distance function modeled by a multi layer perceptron. In contrast to prior relightable implicit neural representations, we do not disentangle the different light transport components, but model both the local and global light transport at each point by a second multi layer perceptron that, in addition, to density features, the current position, the normal (from the signed distance function), view direction, and light position, also takes shadow and highlight hints to aid the network in modeling the corresponding high frequency light transport effects. These hints are provided as a suggestion, and we leave it up to the network to decide how to incorporate these in the final relit result. We demonstrate and validate our neural implicit representation on synthetic and real scenes exhibiting a]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>The appearance of real-world objects is the result of complex light transport interactions between the lighting and the object's intricate geometry and associated material properties. Digitally reproducing the appearance of real-world objects and scenes has been a longstanding goal in computer graphics and computer vision. Inverse rendering methods attempt to undo the complex light transport to determine a sparse set of model parameters that, together with the chosen models, replicates the appearance when rendered. However, teasing apart the different entangled components is ill-posed and often leads to ambiguities. Furthermore, inaccuracies in one model can adversely affect the accuracy at which other components can be disentangled, thus requiring strong regularization and assumptions.</p><p>In this paper we present a novel, NeRF-inspired <ref type="bibr">[Mildenhall et al. 2020]</ref>, neural implicit radiance representation for free viewpoint relighting of general objects and scenes. Instead of using analytical reflectance models and inverse rendering of the neural implicit representations, we follow a data-driven approach and refrain from decomposing the appearance in different light transport components. Therefore, unlike the majority of prior work in relighting neural implicit representations <ref type="bibr">[Boss et al. 2021a</ref><ref type="bibr">[Boss et al. , 2022;;</ref><ref type="bibr">Kuang et al. 2022;</ref><ref type="bibr">Srinivasan et al. 2021;</ref><ref type="bibr">Zheng et al. 2021]</ref>, we relax and enrich the lighting information embedded in handheld captured photographs of the object by illuminating each view from a random point light position. This provides us with a broader unstructured sampling of the space of appearance changes of an object, while retaining the convenience of handheld acquisition. Furthermore, to improve the reproduction quality of difficult to learn components, we provide shadow and highlight hints to the neural radiance representation. Critically, we do not impose how these hints are combined with the estimated radiance (e.g., shadow mapping by multiplying with the light visibility), but instead leave it up to the neural representation to decide how to incorporate these hints in the final result.</p><p>Our hint-driven implicit neural representation is easy to implement, and it requires an order of magnitude less photographs than prior relighting methods that have similar capabilities, and an equal number of photographs compared to state-of-the-art methods that offer less flexibility in the shape and/or materials that can be modeled. Compared to fixed lighting implicit representations such as NeRF <ref type="bibr">[Mildenhall et al. 2020]</ref>, we only require a factor of five times more photographs and twice the render cost while gaining relightability. We demonstrate the effectiveness and validate the robustness of our representation on a variety of challenging synthetic and real objects (e.g., Figure <ref type="figure">1</ref>) containing a wide range of materials (e.g., subsurface scattering, rough specular materials, etc.) variations in shape complexity (e.g., thin features, ill-defined furry shapes, etc.) and global light transport effects (e.g., interreflections, complex shadowing, etc.).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>We focus the discussion of related work on seminal and recent work in image-based relighting, inverse rendering, and relighting neural implicit representations. For an in-depth overview we refer to recent surveys in neural rendering <ref type="bibr">[Tewari et al. 2022</ref>], (re)lighting <ref type="bibr">[Einabadi et al. 2021]</ref>, and appearance modeling <ref type="bibr">[Dong 2019</ref>].</p><p>Image-based Relighting. The staggering advances in machine learning in the last decade have also had a profound effect on imagebased relighting <ref type="bibr">[Debevec et al. 2000]</ref>, enabling new capabilities and improving quality <ref type="bibr">[Bemana et al. 2020;</ref><ref type="bibr">Ren et al. 2015;</ref><ref type="bibr">Xu et al. 2018]</ref>. Deep learning has subsequently been applied to more specialized relighting tasks for portraits <ref type="bibr">[Bi et al. 2021;</ref><ref type="bibr">Meka et al. 2019;</ref><ref type="bibr">Pandey et al. 2021;</ref><ref type="bibr">Sun et al. 2019</ref><ref type="bibr">Sun et al. , 2020]]</ref>, full bodies <ref type="bibr">[Guo et al. 2019;</ref><ref type="bibr">Kanamori and Endo 2018;</ref><ref type="bibr">Meka et al. 2020;</ref><ref type="bibr">Yeh et al. 2022;</ref><ref type="bibr">Zhang et al. 2021a]</ref>, and outdoor scenes <ref type="bibr">[Griffiths et al. 2022;</ref><ref type="bibr">Meshry et al. 2019;</ref><ref type="bibr">Philip et al. 2019]</ref>. It is unclear how to extend these methods to handle scenes that contain objects with ill-defined shapes (e.g., fur) and translucent and specular materials.</p><p>Our method can also be seen as a free-viewpoint relighting method that leverages highlight and shadow hints to help model these challenging effects. <ref type="bibr">Philip et al. [2019]</ref> follow a deep shading approach <ref type="bibr">[Nalbach et al. 2017]</ref> for relighting, mostly diffuse, outdoor scenes under a simplified sun+cloud lighting model. Relit images are created in a two stage process, where an input and output shadow map computed from a proxy geometry is refined, and subsequently used, together with additional render buffers, as input to a relighting network. <ref type="bibr">Zhang et al. [2021a]</ref> introduce a semi-parametric model with residual learning that leverages a diffuse parametric model (i.e., radiance hint) on a rough geometry, and a learned representation that models non-diffuse and global light transport embedded in texture space. To accurately model the non-diffuse effects, Zhang et al. require a large number (&#8764; 8,000) of structured photographs captured with a light stage. Deferred Neural Relighting <ref type="bibr">[Gao et al. 2020</ref>] is closest to our method in terms of capabilities; it can perform free-viewpoint relighting on objects with ill-defined shape with full global illumination effects and complex light-matter interactions (including subsurface scattering and fur). Similar to <ref type="bibr">Zhang et al. [2021a]</ref>, Gao et al. embed learned features in the texture space of a rough geometry that are projected to the target view and multiplied with radiance cues. These radiance cues are visualizations of the rough geometry with different BRDFs (i.e., diffuse and glossy BRDFs with 4 different roughnesses) under the target lighting with global illumination. The resulting images are then used as guidance hints for a neural renderer trained per scene from a large number (&#8764; 10,000) of unstructured photographs of the target scene for random point light-viewpoint combinations to reproduce the reference appearance. <ref type="bibr">Philip et al. [2021]</ref> also use radiance hints (limited to diffuse and mirror radiance) to guide a neural renderer. However, unlike <ref type="bibr">Zhang et al. and Gao et al.</ref> , they pretrain a neural renderer that does not require per-scene finetuning, and that takes radiance cues for both the input and output conditions. Philip et al. require about the same number as input images as our method, albeit lit by a single fixed natural lighting conditions and limited to scenes with hard surfaces and BRDF-like materials. All four methods rely on multi-view stereo which can fail for complex scenes. In contrast our method employs a robust neural implicit representation. Furthermore, all four methods rely on an image-space neural renderer to produce the final relit image. In contrast, our method provides the hints during volume rendering of the neural implicit representation, and thus it is independent of view-dependent image contexts. Our method can relight scenes with the same complexity as <ref type="bibr">Gao et al. [2020]</ref> while only using a similar number of input photographs as <ref type="bibr">Philip et al. [2021]</ref> without sacrificing robustness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model-based Inverse</head><p>Rendering. An alternative to data-driven relighting is inverse rendering (a.k.a. analysis-by-synthesis) where a set of trial model parameters are optimized based on the difference between the rendered model parameters and reference photographs. Inverse rendering at its core is a complex non-linear optimization problem. Recent advances in differentiable rendering <ref type="bibr">[Li et al. 2018;</ref><ref type="bibr">Loper and Black 2014;</ref><ref type="bibr">Nimier-David et al. 2019;</ref><ref type="bibr">Xing et al. 2022]</ref> have enabled more robust inverse rendering for more complex scenes and capture conditions. BID-R++ <ref type="bibr">[Chen et al. 2021</ref>] combines differentiable ray tracing and rasterization to model spatially varying reflectance parameters and spherical Gaussian lighting for a known triangle mesh. <ref type="bibr">Munkberg et al. [2022]</ref> alternate between optimizing an implicit shape representation (i.e., a signed distance field), and reflectance and lighting defined on a triangle mesh. <ref type="bibr">Hasselgren et al. [2022]</ref> extend the work of <ref type="bibr">Munkberg et al. [2022]</ref> with a differentiable Monte Carlo renderer to handle area light sources, and embed a denoiser to mitigate the adverse effects of Monte Carlo noise on the gradient computation to drive the non-linear optimizer. <ref type="bibr">Similarly, Fujun et al. [2021]</ref> also employ a differentiable Monte Carlo renderer for estimating shape and spatially-varying reflectance from a small set of colocated view/light photographs. All of these methods focus on direct lighting only and can produce suboptimal results for objects or scenes with strong interreflections. A notable exception is the method of <ref type="bibr">Cai et al. [2022]</ref> that combines explicit and implicit geometries and demonstrates inverse rendering under known lighting on a wide range of opaque objects while taking indirect lighting in account. All of the above methods eventually express the shape as a triangle mesh, limiting their applicability to objects with well defined surfaces. Furthermore, the accuracy of these methods is inherently limited by the representational power of the underlying BRDF and lighting models.</p><p>Neural Implicit Representations. A major challenge in inverse rendering with triangle meshes is to efficiently deal with changes in topology during optimization. An alternative to triangle mesh representations is to use a volumetric representation where each voxel contains an opacity/density estimate and a description of the reflectance properties. While agnostic to topology changes, voxel grids are memory intensive and, even with grid warping <ref type="bibr">[Bi et al. 2020]</ref>, fine-scale geometrical details are difficult to model.</p><p>To avoid the inherent memory overhead of voxel grids, NeRF <ref type="bibr">[Mildenhall et al. 2020</ref>] models the continuous volumetric density and spatially varying color with two multi layer perceptrons (MLPs) parameterized by position (and also view direction for color). The MLPs in NeRF are trained per scene such that the accumulated density and color ray marched along a view ray matches the observed radiance in reference photographs. NeRF has been shown to be exceptionally effective in modeling the outgoing radiance field of a wide range of object types, including those with ill-defined shapes and complex materials. One of the main limitations of NeRF is that the illumination present at capture-time is baked into the model. Several methods have been introduced to support post-capture relighting under a restricted lighting model <ref type="bibr">[Li et al. 2022;</ref><ref type="bibr">Martin-Brualla et al. 2021]</ref>, or by altering the color MLP to produce the parameters to drive an analytical model of the appearance of objects <ref type="bibr">[Boss et al. 2021a</ref><ref type="bibr">[Boss et al. , 2022</ref><ref type="bibr">[Boss et al. , 2021b;;</ref><ref type="bibr">Kuang et al. 2022;</ref><ref type="bibr">Srinivasan et al. 2021;</ref><ref type="bibr">Yao et al. 2022;</ref><ref type="bibr">Zhang et al. 2021c</ref>], participating media <ref type="bibr">[Zheng et al. 2021]</ref>, or even whole outdoor scenes <ref type="bibr">[Rudnev et al. 2022]</ref>.</p><p>Due to the high computational cost of ray marching secondary rays, na&#239;vely computing shadows and indirect lighting is impractical. <ref type="bibr">Zhang et al. [2021c]</ref>, <ref type="bibr">Li et al. [2022]</ref>, and <ref type="bibr">Yang et al. [2022]</ref> avoid tracing shadow rays by learning an additional MLP to model the ratio of light occlusion. However, all three methods ignore indirect lighting. <ref type="bibr">Zheng et al. [2021]</ref> model the indirect lighting inside a participating media using an MLP that returns the coefficients of a 5band expansion. NeILF <ref type="bibr">[Yao et al. 2022</ref>] embeds the indirect lighting and shadows in a (learned) 5D incident light field for a scene with known geometry. NeRV <ref type="bibr">[Srinivasan et al. 2021</ref>] modifies the color MLP to output BRDF parameters and a visibility field that models the distance to the nearest 'hard surface' and lighting visibility. The visibility field allows them to bypass the expensive ray marching step for shadow computation and one-bounce indirect illumination. A disadvantage of these solutions is that they do not guarantee that the estimated density field and the occlusions are coupled. In contrast, our method directly ties occlusions to the estimated implicit geometry reproducing more faithful shadows. Furthermore, these methods rely on BRDFs to model the surface reflectance, precluding scenes with complex light-matter interactions.</p><p>NeLF <ref type="bibr">[Sun et al. 2021</ref>] aims to relight human faces, and thus accurately reproducing subsurface scattering is critical. Therefore, Sun et al. characterize the radiance and global light transport by an MLP. We also leverage an MLP to model local and global light transport. A key difference is that our method parameterizes this MLP in terms of view and light directions, whereas NeLF directly outputs a full light transport vector and compute a relit color via an inner-product with the lighting. While better suited for relighting with natural lighting, NeLF is designed for relighting human faces which only exhibit limited variations in shape and reflectance.</p><p>Similar in spirit to our method, <ref type="bibr">Lyu et al. [2022]</ref> model light transport using an MLP, named a Neural Radiance Transfer Field (NRTF). However, unlike us, Lyu et al. train the MLP on synthetic training data generated from a rough BRDF approximation obtained through physically based inverse rendering on a triangle mesh extracted from a neural signed distance field <ref type="bibr">[Wang et al. 2021</ref>] computed from unstructured observations of the scene under static natural lighting. To correct the errors due the rough BRDF approximation, a final refinement step of the MLP is performed using the captured photographs. Similar to Lyu et al. we also use an MLP to model light transport, including indirect lighting. However, unlike Lyu et al. we do not rely solely on an MLP to model high frequency light transport effects such as light occlusions and specular highlights. Instead we provide shadow and highlight hints to the radiance network and let the training process discover how to best leverage these hints. Furthermore, we rely on a neural representation for shape jointly optimized with the radiance, allowing us to capture scenes with ill-defined geometry. In contrast, Lyu et al. optimize shape (converted to a triangle mesh) and radiance separately, making their method sensitive to shape errors and restricted to objects with a well-defined shape.</p><p>An alternative to using an implicit neural density field, is to model the shape via a signed distance field (SDF). Similar to the majority of NeRF-based methods, PhySG <ref type="bibr">[Zhang et al. 2021b</ref>] and IRON <ref type="bibr">[Zhang et al. 2022a</ref>] also rely on an MLP to represent volumetric BRDF parameters. However, due to the high computational cost, these methods do not take shadowing or indirect lighting in account. <ref type="bibr">Zhang et al. [2022b]</ref> model indirect lighting separately, and train an additional incident light field MLP using the incident lighting computed at each point via ray casting the SDF geometry. While our method also builds on a neural implicit representation <ref type="bibr">[Wang et al. 2021</ref>], our method does not rely on an underlying parametric BRDF model, but instead models the full light transport via an by scaling the radiance with the light source color (i.e., linearity of light transport).</p><p>Given the output from the density network as well as the output from the radiance network , the color along a view ray starting at the camera position o in a direction v is given by:</p><p>where the sample position along the view ray is p = o + v at depth , n is the normal computed as the normalized SDF gradient:</p><p>v is the view direction, l is the point light position, &#175; the corresponding feature vector from the density MLP, and &#920; is a set of additional hints provided to the radiance network (described in subsection 3.2). Analogous to NeuS, the view direction, light position, and hints are all frequency encoded with 4 bands. Finally, ( ) is the unbiased density weight <ref type="bibr">[Wang et al. 2021</ref>] computed by:</p><p>with the transmittance over opacity , &#934; the CDF of the PDF used to compute the density from the SDF . To speed up the computation of the color, the integral in Equation 1 is computed by importance sampling the density field along the view ray.</p><p>In the spirit of image-based relighting, we opt to have the relightable radiance MLP network include global light transport effects such as interreflections and occlusions. While MLPs are in theory universal approximators, some light transport components are easier to learn (e.g., diffuse reflections) than others. Especially high frequency light transport components such as shadows and specular highlights pose a problem. At the same time, shadows and specular highlights are highly correlated with the geometry of the scene and thus the density field. To leverage this embedded knowledge, we provide the relightable radiance MLP with additional shadow and highlight hints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Light Transport Hints</head><p>Shadow Hints. While the relightable radiance network is able to roughly model the effects of light source occlusion, the resulting shadows typically lack sharpness and detail. Yet, light source occlusion can be relatively easily evaluated by collecting the density along a shadow ray towards the light source. While this process is relatively cheap for a single shadow ray, performing a secondary ray march for each primary ray's sampled position increases the computation cost by an order of magnitude, quickly becoming too expensive for practical training. However, we observe that for most primary rays, the ray samples are closely packed together around the zero level-set in the SDF due to the importance sampling of the density along the view ray. Hence, we propose to approximate light source visibility by shooting a single shadow ray at the zero level-set, and use the same light source visibility for each sample along the view ray. To determine the depth of the zero level-set, we compute the density weighted depth along the view ray:</p><p>While for an opaque surface a single shadow ray is sufficient, for non-opaque or ill-defined surfaces a single shadow ray offers a poor estimate of the light occlusion. Furthermore, using the shadow information as a hard mask, ignores the effects of indirect lighting. We therefore provide the shadow information as a additional input to the radiance network, allowing the network learn whether to include or ignore the shadowing information as well as blend any indirect lighting in the shadow regions.</p><p>Highlight Hints. Similar to shadows, specular highlights are sparsely distributed high frequency light transport effects. Inspired by <ref type="bibr">Gao et al. [2020]</ref>, we provide specular highlight hints to the radiance network by evaluating 4 microfacet BRDFs with a GGX distribution <ref type="bibr">[Walter et al. 2007</ref>] with roughness parameters {0.02, 0.05, 0.13, 0.34}. Unlike Gao et al. , we compute the highlight hints using local shading which only depends on the surface normal computed from the SDF (Equation <ref type="formula">2</ref>), and pass it to the radiance MLP as an additional input. Similar to shadow hints, we compute one highlight hint per view ray and reused it for all samples along the view ray.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Loss &amp; Training</head><p>We jointly train the density and radiance network using an image reconstruction loss L and an SDF regularization loss L . The image reconstruction loss is defined as the 1 distance between the observation &#175; (o, v) and the corresponding estimated color (o, v) computed using Equation 1: L = || &#175; -|| 1 , for a random sampling of pixels (and thus view rays) in the captured training images (subsection 3.4). Furthermore, we follow NeuS, and regularize the density MLP with the Eikonal loss <ref type="bibr">[Gropp et al. 2020]</ref> to ensure a valid SDF: L = (||&#8711; (p)|| 2 -1) 2 . For computational efficiency, we do not back-propagate gradients from the shadow and highlight hints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Data Acquisition</head><p>Training the implicit representation requires observations of the scene viewed from random viewpoints and lit from a different random light position such that shadows and interreflections are included. We follow the procedure from <ref type="bibr">Gao et al. [2020]</ref>: a handheld camera is used to capture photographs of the scene from random viewpoints while a second camera captures the scene with its colocated flash light enabled. The images from the second camera are only used to calibrate the light source position. To aid camera calibration, the scene is placed on a checkerboard pattern.</p><p>All examples in this paper are captured with a Sony A7II as the primary camera, and an iPhone 13 Pro as the secondary camera. The acquisition process takes approximately 10 minutes; the main bottleneck in acquisition is moving the cameras around the scene. In practice we capture a video sequence from each camera and randomly select 500-1,000 frames as our training data. The video is captured using S-log encoding to minimize overexposure. For the synthetic scenes, we simulate the acquisition process by randomly sampling view and light positions on the upper hemisphere around the scene with a random distance between 2 to 2.5 times the size of the scene. The synthetic scenes are rendered with global light transport using Blender Cycles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5">Viewpoint Optimization</head><p>Imperfections in camera calibration can cause inaccurate reconstructions of thin geometrical features as well as lead to blurred results. To mitigate the impact of camera calibration errors, we jointly optimize the viewpoints and the neural representation.</p><p>Given an initial view orientation 0 and view position 0 , we formulate the refined camera orientation and position as:</p><p>where &#916; &#8712; SO(3) and &#916; &#8712; R 3 are learnable correction transformations. During training, we back-propagate, the reconstruction loss, in addition to the relightable radiance network, to the correction transformations. We assume that the error on the initial camera calibration is small, and thus we limit the viewpoint changes by using a 0.06&#215; smaller learning rate for the correction transformations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">RESULTS</head><p>We implemented our neural implicit radiance representation in Py-Torch <ref type="bibr">[Paszke et al. 2019]</ref>. We train each model for 1,000 iterations using the Adam optimizer [Kingma and Ba 2015] with 1 = 0.9 and 2 = 0.999 with 512 samples per iteration randomly drawn from the training images. We follow the same warmup and cosine decay learning rate schedule as in NeuS <ref type="bibr">[Wang et al. 2021]</ref>. Training a single neural implicit radiance representation takes approximate 20 hours on four Nvidia V100 GPUs.</p><p>We extensively validate the relighting capabilities of our neural implicit radiance representation on 17 synthetic and 7 captured scenes (including 4 from <ref type="bibr">[Gao et al. 2020]</ref>), covering a wide range of different shapes, materials, and lighting effects. Synthetic Scenes. Figure <ref type="figure">3</ref> shows relit results of different synthetic scenes. For each example, we list PSNR, SSIM, and LPIPS <ref type="bibr">[Zhang et al. 2018]</ref> error statistics computed over 100 test images different from the 500 training images. Our main test scene contains a vase and two dice; the scene features a highly concave object (vase) and complex interreflections between the dice. We include several versions of the main test scene with different material properties: Diffuse, Metallic, Glossy-Metal, Rough-Metal, Anisotropic-Metal, Plastic, Glossy-Plastic, Rough-Plastic and Translucent; note, some versions are only included in the supplemental material. We also include two versions with modified  geometry: Short-Fur and Long-Fur to validate the performance of our method on shapes with ill-defined geometry. In addition, we also include a Fur-Ball scene which exhibits even longer fur. To validate the performance of the shadow hints, we also include scenes with complex shadows: a Basket scene containing thin geometric features and a Layered Woven Ball which combines complex visibility and strong interreflections. In addition to these specially engineered scenes to systematically probe the capabilities of our method, we also validate our neural implicit radiance representation on commonly used synthetic scenes in neural implicit modeling: Hotdog, Lego and Drums <ref type="bibr">[Mildenhall et al. 2020</ref>]. Based on the error statistics, we see that the error correlates with the geometric complexity of the scene (vase and dice, Hotdog, and Layered Woven Ball perform better than the Fur scenes as well as scenes with small details such as the Lego and the Drums scene), and with the material properties (highly specular materials such as Metallic and Anisotropic-Metal incur a higher error). Visually, differences are most visible in specular reflections and for small geometrical details.</p><p>Captured Scenes. We demonstrate the capabilities of our neural implicit relighting representation by modeling 3 new scenes captured with handheld setups (Figure <ref type="figure">4</ref>). The Pikachu Statue scene contains glossy highlights and significant self-occlusion. The Cat on Decor scene showcases the robustness of our method on real-world objects with ill-defined geometry. The Cup and Fabric scene exhibits translucent materials (cup), specular reflections of the balls, and anisotropic reflections on the fabric. We refer to the supplementary material for additional video sequences of these scenes visualized for rotating camera and light positions.</p><p>Comparisons. Figure <ref type="figure">5</ref> compares our method to IRON <ref type="bibr">[Zhang et al. 2022b</ref>], an inverse rendering method that adopts a neural representation for geometry as a signed distance field. From these we can see that IRON fails to correctly reconstruct the shape and reflections in the presence of strong interreflections. In a second comparison (Figure <ref type="figure">6</ref>), we compare our method to Neural Radiance Transfer Fields (NRTF) <ref type="bibr">[Lyu et al. 2022]</ref>; we skip the fragile inverse rendering step and train NRTF with 500 reference OLAT images and the reference geometry. To provide a fair comparison, we also train and evaluate our network under the same directional OLAT images by conditioning the radiance network on light direction instead of point light position. From this test we observe that NRTF struggles to accurately reproduce shadow edges and specular interreflections, as well as that our method can also be successfully trained with directional lighting. Frahm 2016] fails for this scene, we input geometry reconstructed from the NeuS SDF as well as ground truth geometry. Finally, we also render the input images under the reference target lighting; our network is trained without access to the target lighting. Even under these favorable conditions, the relighting method of Philip et al. struggles to reproduce the correct appearance. Finally, we compare our method to Deferred Neural Lighting <ref type="bibr">[Gao et al. 2020]</ref> (using their data and trained model). Our method is able to achieve similar quality results from &#8764;500 input images compared to &#8764;10,000 input images for Deferred Neural Lighting. While visually very similar, the overall errors of Deferred Neural Lighting are slightly lower than with our method. This is mainly due to differences in Neural Lighting tries to minimize the differences for each frame separately, and thus it can embed camera calibration errors in the images. However, this comes at the cost of temporal "shimmering" when calibration is not perfect. Our method on the other hand, optimizes the 3D representation, yielding better temporal stability (and thus requiring less photographs for view interpolation) at the cost of slightly blurring the images in the presence of camera calibration errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">ABLATION STUDIES</head><p>We perform several ablation experiments (visual and quantitative) on the synthetic datasets to evaluate the impact of each of the components that comprise our neural implicit radiance representation.</p><p>Shadow and Highlight Hints. A key contribution is the inclusion of shadow and highlight hints in the relightable radiance MLP. Figure <ref type="figure">9</ref> shows the impact of training without the shadow hint, the highlight hint, or both. Without shadow hints the method fails to correctly reproduce sharp shadow boundaries on the ground plane. This lack of sharp shadows is also reflected in the quantitative errors summarized in Table <ref type="table">1</ref>. Including the highlight hints yield a better highlight reproduction, e.g., in the mouth of the vase.</p><p>Impact of the Number of Shadow Rays. We currently only use a single shadow ray to compute the shadow hint. However, we can also shoot multiple shadow rays (by importance sampling points along the view ray) and provide a more accurate hint to the radiance network. Figure <ref type="figure">10</ref> shows the results of a radiance network trained with 16 shadow rays. While providing a more accurate shadow hint, there is marginal benefit at a greatly increased computational cost, justifying our choice of a single shadow ray for computing the shadow hint.</p><p>NeuS vs. NeRF Density MLP. While the relightable radiance MLP learns how much to trust the shadow hint (worst case it can completely ignore unreliable hints), the radiance MLP can in general not reintroduce high-frequency details if it is not included in the shadow hints. To obtain a good shadow hint, an accurate depth estimate of the mean depth along the view ray is needed. <ref type="bibr">Wang et al. [2021]</ref> noted that NeRF produces a biased depth estimate, and they introduced NeuS to address this problem. Replacing NeuS by NeRF for the density network (Figure <ref type="figure">10</ref>) leads to poor shadow reproduction due to the adverse impact of the biased depth estimates on the shadow hints.</p><p>Impact of the number of Basis Materials for the Highlight Hints. Table <ref type="table">1</ref> shows the results of using 1, 2, 4 and 8 basis materials for computing the highlight hints. Additional highlights hints improve the results up to a point; when too many hints are provided erroneous correlations can increase the overall error. 4 basis materials strike a good balance between computational cost, network complexity, and quality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Impact of</head><p>Number of Training Images. Figure 11 and Table 1 demonstrate the effect of varying the number of input images from 50, 100, 250 to 500. As expected, more training images improve the results, and with increasing number of images, the increase in improvement diminishes. With 250 images we already achieve plausible relit results. Decreasing the number of training images further introduces noticeable appearance differences. Effectiveness of Viewpoint Optimization. Figure 12 and Table 2 demonstrate the effectiveness of viewpoint optimization on real captured scenes. While the improvement in quantitative errors is limited, visually we can see that viewpoint optimization significantly enhances reconstruction quality with increased sharpness and better preservation of finer details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">LIMITATIONS</head><p>While our neural implicit radiance representation greatly reduces the number of required input images for relighting scenes with complex shape and materials, it is not without limitations. Currently we provide shadow and highlight hints to help the relightable radiance MLP model high frequency light transport effects. However, other high frequency effects exist. In particular highly specular surfaces that reflect other parts of the scene pose a challenge to the radiance network. Na&#239;ve inclusion of 'reflection hints' and/or reparameterizations <ref type="bibr">[Verbin et al. 2022]</ref> fail to help the network, mainly due to the reduced accuracy of the surface normals (needed to predict the reflected direction) for sharp specular materials. Resolving this limitation is a key challenge for future research in neural implicit modeling for image-based relighting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION</head><p>In this paper we presented a novel neural implicit radiance representation for free viewpoint relighting from a small set of unstructured photographs. Our representation consists of two MLPs: one for modeling the SDF (analogous to NeuS) and a second MLP for modeling the local and indirect radiance at each point. Key to our method is the inclusion of shadow and highlight hints to aid the relightable radiance MLP to model high frequency light transport effects. Our    <ref type="bibr">[Gao et al. 2020</ref>]. We train our neural implicit radiance representation using only 1/25th (&#8764;500) randomly selected frames for Gao et al.'s datasets, while achieving comparable results.</p><p>method is able to produce relit results from just &#8764; 500 photographs of the scene; a saving of one to two order of magnitude compared to prior work with similar capabilities.   w/o Viewpoint Optimization 31.43 | 0.9803 | 0.0375 w/ Viewpoint Optimization 35.08 | 0.9877 | 0.0.359 </p></div></body>
		</text>
</TEI>
