We present EgoRenderer, a system for rendering full-body neural avatars of a person captured by a wearable, egocentric fisheye camera that is mounted on a cap or a VR headset. Our system renders photorealistic novel views of the actor and her motion from arbitrary virtual camera locations. Rendering full-body avatars from such egocentric images come with unique challenges due to the top-down view and large distortions. We tackle these challenges by decomposing the rendering process into several steps, including texture synthesis, pose construction, and neural image translation. For texture synthesis, we propose Ego-DPNet, a neural network that infers dense correspondences between the input fisheye images and an underlying parametric body model, and to extract textures from egocentric inputs. In addition, to encode dynamic appearances, our approach also learns an implicit texture stack that captures detailed appearance variation across poses and viewpoints. For correct pose generation, we first estimate body pose from the egocentric view using a parametric model. We then synthesize an external free-viewpoint pose image by projecting the parametric model to the user-specified target viewpoint. We next combine the target pose image and the textures into a combined feature image, which is transformed into the output color image using a neural image translation network. Experimental evaluations show that EgoRenderer is capable of generating realistic free-viewpoint avatars of a person wearing an egocentric camera. Comparisons to several baselines demonstrate the advantages of our approach.
more »
« less
IMEXnet - A Forward Stable Deep Neural Network
Deep convolutional neural networks have revolutionized many machine learning and computer vision tasks, however, some remaining key challenges limit their wider use. These challenges include improving the network's robustness to perturbations of the input image and the limited ``field of view'' of convolution operators. We introduce the IMEXnet that addresses these challenges by adapting semi-implicit methods for partial differential equations. Compared to similar explicit networks, such as residual networks, our network is more stable, which has recently shown to reduce the sensitivity to small changes in the input features and improve generalization. The addition of an implicit step connects all pixels in each channel of the image and therefore addresses the field of view problem while still being comparable to standard convolutions in terms of the number of parameters and computational complexity. We also present a new dataset for semantic segmentation and demonstrate the effectiveness of our architecture using the NYU Depth dataset.
more »
« less
- Award ID(s):
- 1751636
- PAR ID:
- 10095794
- Date Published:
- Journal Name:
- International Conference on Machine Learning
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Implicit neural networks are a general class of learning models that replace the layers in traditional feedforward models with implicit algebraic equations. Compared to traditional learning models, implicit networks offer competitive performance and reduced memory consumption. However, they can remain brittle with respect to input adversarial perturbations. This paper proposes a theoretical and computational framework for robustness verification of implicit neural networks; our framework blends together mixed monotone systems theory and contraction theory. First, given an implicit neural network, we introduce a related embedded network and show that, given an infinity-norm box constraint on the input, the embedded network provides an infinity-norm box overapproximation for the output of the original network. Second, using infinity-matrix measures, we propose sufficient conditions for well-posedness of both the original and embedded system and design an iterative algorithm to compute the infinity-norm box robustness margins for reachability and classification problems. Third, of independent value, we show that employing a suitable relative classifier variable in our analysis will lead to tighter bounds on the certified adversarial robustness in classification problems. Finally, we perform numerical simulations on a Non-Euclidean Monotone Operator Network (NEMON) trained on the MNIST dataset. In these simulations, we compare the accuracy and run time of our mixed monotone contractive approach with the existing robustness verification approaches in the literature for estimating the certified adversarial robustness.more » « less
-
This paper presents a neural incremental Structure-from-Motion (SfM) approach, Level-S2fM, which estimates the camera poses and scene geometry from a set of uncalibrated images by learning coordinate MLPs for the implicit surfaces and the radiance fields from the established key-point correspondences. Our novel formulation poses some new challenges due to inevitable two-view and few-view configurations in the incremental SfM pipeline, which complicates the optimization of coordinate MLPs for volumetric neural rendering with unknown camera poses. Nevertheless, we demonstrate that the strong inductive basis conveying in the 2D correspondences is promising to tackle those challenges by exploiting the relationship between the ray sampling schemes. Based on this, we revisit the pipeline of incremental SfM and renew the key components, including two-view geometry initialization, the camera poses registration, the 3D points triangulation, and Bundle Adjustment, with a fresh perspective based on neural implicit surfaces. By unifying the scene geometry in small MLP networks through coordinate MLPs, our Level-S2fM treats the zero-level set of the implicit surface as an informative top-down regularization to manage the reconstructed 3D points, reject the outliers in correspondences via querying SDF, and refine the estimated geometries by NBA (Neural BA). Not only does our Level-S2fM lead to promising results on camera pose estimation and scene geometry reconstruction, but it also shows a promising way for neural implicit rendering without knowing camera extrinsic beforehand.more » « less
-
Synthetic data is highly useful for training machine learning systems performing image-based 3D reconstruction, as synthetic data has applications in both extending existing generalizable datasets and being tailored to train neural networks for specific learning tasks of interest. In this paper, we introduce and utilize a synthetic data generation suite capable of generating data given existing 3D scene models as input. Specifically, we use our tool to generate image sequences for use with Multi-View Stereo (MVS), moving a camera through the virtual space according to user-chosen camera parameters. We evaluate how the given camera parameters and type of 3D environment affect how applicable the generated image sequences are to the MVS task using five pre-trained neural networks on image sequences generated from three different 3D scene datasets. We obtain generated predictions for each combination of parameter value and input image sequence, using standard error metrics to analyze the differences in depth predictions on image sequences across 3D datasets, parameters, and networks. Among other results, we find that camera height and vertical camera viewing angle are the parameters that cause the most variation in depth prediction errors on these image sequences.more » « less
-
Multi-view learning is a rapidly evolving research area focused on developing diverse learning representations. In neural data analysis, this approach holds immense potential by capturing spatial, temporal, and frequency features. Despite its promise, multi-view application to functional near-infrared spectroscopy (fNIRS) has remained largely unexplored. This study addresses this gap by introducing fNIRSNET, a novel framework that generates and fuses multi-view spatio-temporal representations using convolutional neural networks. It investigates the combined informational strength of oxygenated (HbO2) and deoxygenated (HbR) hemoglobin signals, further extending these capabilities by integrating with electroencephalography (EEG) networks to achieve robust multimodal classification. Experiments involved classifying neural responses to auditory stimuli with nine healthy participants. fNIRS signals were decomposed into HbO2/HbR concentration changes, resulting in Parallel and Merged input types. We evaluated four input types across three data compositions: balanced, subject, and complete datasets. Our fNIRSNET's performance was compared with eight baseline classification models and merged it with four common EEG networks to assess the efficacy of combined features for multimodal classification. Compared to baselines, fNIRSNET using the Merged input type achieved the highest accuracy of 83.22%, 81.18%, and 91.58% for balanced, subject, and complete datasets, respectively. In the complete set, the approach effectively mitigated class imbalance issues, achieving sensitivity of 83.58% and specificity of 95.42%. Multimodal fusion of EEG networks and fNIRSNET outperformed single-modality performance with the highest accuracy of 87.15% on balanced data. Overall, this study introduces an innovative fusion approach for decoding fNIRS data and illustrates its integration with established EEG networks to enhance performance.more » « less
An official website of the United States government

