skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ReShader: View-Dependent Highlights for Single Image View-Synthesis
In recent years, novel view synthesis from a single image has seen significant progress thanks to the rapid advancements in 3D scene representation and image inpainting techniques. While the current approaches are able to synthesize geometrically consistent novel views, they often do not handle the view-dependent effects properly. Specifically, the highlights in their synthesized images usually appear to be glued to the surfaces, making the novel views unrealistic. To address this major problem, we make a key observation that the process of synthesizing novel views requires changing the shading of the pixels based on the novel camera, and moving them to appropriate locations. Therefore, we propose to split the view synthesis process into two independent tasks of pixel reshading and relocation. During the reshading process, we take the single image as the input and adjust its shading based on the novel camera. This reshaded image is then used as the input to an existing view synthesis method to relocate the pixels and produce the final novel view image. We propose to use a neural network to perform reshading and generate a large set of synthetic input-reshaded pairs to train our network. We demonstrate that our approach produces plausible novel view images with realistic moving highlights on a variety of real world scenes.  more » « less
Award ID(s):
2238193
PAR ID:
10511746
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM Transactions on Graphics
Volume:
42
Issue:
6
ISSN:
0730-0301
Page Range / eLocation ID:
1 to 9
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present EgoRenderer, a system for rendering full-body neural avatars of a person captured by a wearable, egocentric fisheye camera that is mounted on a cap or a VR headset. Our system renders photorealistic novel views of the actor and her motion from arbitrary virtual camera locations. Rendering full-body avatars from such egocentric images come with unique challenges due to the top-down view and large distortions. We tackle these challenges by decomposing the rendering process into several steps, including texture synthesis, pose construction, and neural image translation. For texture synthesis, we propose Ego-DPNet, a neural network that infers dense correspondences between the input fisheye images and an underlying parametric body model, and to extract textures from egocentric inputs. In addition, to encode dynamic appearances, our approach also learns an implicit texture stack that captures detailed appearance variation across poses and viewpoints. For correct pose generation, we first estimate body pose from the egocentric view using a parametric model. We then synthesize an external free-viewpoint pose image by projecting the parametric model to the user-specified target viewpoint. We next combine the target pose image and the textures into a combined feature image, which is transformed into the output color image using a neural image translation network. Experimental evaluations show that EgoRenderer is capable of generating realistic free-viewpoint avatars of a person wearing an egocentric camera. Comparisons to several baselines demonstrate the advantages of our approach. 
    more » « less
  2. Inferring the 3D structure underlying a set of multi-view images typically requires solving two co-dependent tasks -- accurate 3D reconstruction requires precise camera poses, and predicting camera poses relies on (implicitly or explicitly) modeling the underlying 3D. The classical framework of analysis by synthesis casts this inference as a joint optimization seeking to explain the observed pixels, and recent instantiations learn expressive 3D representations (e.g., Neural Fields) with gradient-descent-based pose refinement of initial pose estimates. However, given a sparse set of observed views, the observations may not provide sufficient direct evidence to obtain complete and accurate 3D. Moreover, large errors in pose estimation may not be easily corrected and can further degrade the inferred 3D. To allow robust 3D reconstruction and pose estimation in this challenging setup, we propose SparseAGS, a method that adapts this analysis-by-synthesis approach by: a) including novel-view-synthesis-based generative priors in conjunction with photometric objectives to improve the quality of the inferred 3D, and b) explicitly reasoning about outliers and using a discrete search with a continuous optimization-based strategy to correct them. We validate our framework across real-world and synthetic datasets in combination with several off-the-shelf pose estimation systems as initialization. We find that it significantly improves the base systems' pose accuracy while yielding high-quality 3D reconstructions that outperform the results from current multi-view reconstruction baselines. 
    more » « less
  3. With the ever-increasing amount of 3D data being captured and processed, multi-view image compression is essential to various applications, including virtual reality and 3D modeling. Despite the considerable success of learning-based compression models on single images, limited progress has been made in multi-view image compression. In this paper, we propose an efficient approach to multi-view image compression by leveraging the redundant information across different viewpoints without explicitly using warping operations or camera parameters. Our method builds upon the recent advancements in Multi-Reference Entropy Models (MEM), which were initially proposed to capture correlations within an image. We extend the MEM models to employ cross-view correlations in addition to within-image correlations. Specifically, we generate latent representations for each view independently and integrate a cross-view context module within the entropy model. The estimation of entropy parameters for each view follows an autoregressive technique, leveraging correlations with the previous views. We show that adding this view context module further enhances the compression performance when jointly trained with the autoencoder. Experimental results demonstrate superior performance compared to both traditional and learning-based multi-view compression methods. 
    more » « less
  4. We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods—from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps)—addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality. Notably, our models surpass all previous methods even with reduced computational resources (1-2 GPUs). 
    more » « less
  5. null (Ed.)
    Human novel view synthesis aims to synthesize target views of a human subject given input images taken from one or more reference viewpoints. Despite significant advances in model-free novel view synthesis, existing methods present two major limitations when applied to complex shapes like humans. First, these methods mainly focus on simple and symmetric objects, e.g., cars and chairs, limiting their performances to fine-grained and asymmetric shapes. Second, existing methods cannot guarantee visual consistency across different adjacent views of the same object. To solve these problems, we present in this paper a learning framework for the novel view synthesis of human subjects, which explicitly enforces consistency across different generated views of the subject. Specifically, we introduce a novel multi-view supervision and an explicit rotational loss during the learning process, enabling the model to preserve detailed body parts and to achieve consistency between adjacent synthesized views. To show the superior performance of our approach, we present qualitative and quantitative results on the Multi-View Human Action (MVHA) dataset we collected (consisting of 3D human models animated with different Mocap sequences and captured from 54 different viewpoints), the Pose-Varying Human Model (PVHM) dataset, and ShapeNet. The qualitative and quantitative results demonstrate that our approach outperforms the state-of-the-art baselines in both per-view synthesis quality, and in preserving rotational consistency and complex shapes (e.g. fine-grained details, challenging poses) across multiple adjacent views in a variety of scenarios, for both humans and rigid objects. 
    more » « less