Current deep neural network approaches for camera
pose estimation rely on scene structure for 3D motion estimation,
but this decreases the robustness and thereby makes
cross-dataset generalization difficult. In contrast, classical
approaches to structure from motion estimate 3D motion
utilizing optical flow and then compute depth. Their accuracy,
however, depends strongly on the quality of the optical
flow. To avoid this issue, direct methods have been
proposed, which separate 3D motion from depth estimation,
but compute 3D motion using only image gradients
in the form of normal flow. In this paper, we introduce
a network NFlowNet, for normal flow estimation which is
used to enforce robust and direct constraints. In particular,
normal flow is used to estimate relative camera pose based
on the cheirality (depth positivity) constraint. We achieve
this by formulating the optimization problem as a differentiable
cheirality layer, which allows for end-to-end learning
of camera pose. We perform extensive qualitative and quantitative
evaluation of the proposed DiffPoseNet’s sensitivity
to noise and its generalization across datasets. We compare
our approach to existing state-of-the-art methods on KITTI,
TartanAir, and TUM-RGBD datasets.
more »
« less
Predicting Camera Viewpoint Improves Cross-dataset Generalization for 3D Human Pose Estimation
Monocular estimation of 3d human pose has attracted in- creased attention with the availability of large ground-truth motion capture datasets. However, the diversity of training data available is limited and it is not clear to what extent methods generalize outside the specific datasets they are trained on. In this work we carry out a systematic study of the diversity and biases present in specific datasets and its e↵ect on cross-dataset generalization across a compendium of 5 pose datasets. We specifically focus on systematic di↵erences in the distri- bution of camera viewpoints relative to a body-centered coordinate frame. Based on this observation, we propose an auxiliary task of predicting the camera viewpoint in addition to pose. We find that models trained to jointly predict viewpoint and pose systematically show significantly improved cross-dataset generalization.
more »
« less
- Award ID(s):
- 1813785
- PAR ID:
- 10296118
- Date Published:
- Journal Name:
- IEEE International Conference on Computer Vision workshops
- ISSN:
- 2473-9936
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This work proposes a novel pose estimation model for object categories that can be effectively transferred to pre-viously unseen environments. The deep convolutional network models (CNN) for pose estimation are typically trained and evaluated on datasets specifically curated for object detection, pose estimation, or 3D reconstruction, which requires large amounts of training data. In this work, we propose a model for pose estimation that can be trained with small amount of data and is built on the top of generic mid-level represen-tations [33] (e.g. surface normal estimation and re-shading). These representations are trained on a large dataset without requiring pose and object annotations. Later on, the predictions are refined with a small CNN neural network that exploits object masks and silhouette retrieval. The presented approach achieves superior performance on the Pix3D dataset [26] and shows nearly 35 % improvement over the existing models when only 25 % of the training data is available. We show that the approach is favorable when it comes to generalization and transfer to novel environments. Towards this end, we introduce a new pose estimation benchmark for commonly encountered furniture categories on challenging Active Vision Dataset [1] and evaluated the models trained on the Pix3D dataset.more » « less
-
null (Ed.)Monocular depth predictors are typically trained on large-scale training sets which are naturally biased w.r.t the distribution of camera poses. As a result, trained predic- tors fail to make reliable depth predictions for testing exam- ples captured under uncommon camera poses. To address this issue, we propose two novel techniques that exploit the camera pose during training and prediction. First, we in- troduce a simple perspective-aware data augmentation that synthesizes new training examples with more diverse views by perturbing the existing ones in a geometrically consis- tent manner. Second, we propose a conditional model that exploits the per-image camera pose as prior knowledge by encoding it as a part of the input. We show that jointly ap- plying the two methods improves depth prediction on im- ages captured under uncommon and even never-before-seen camera poses. We show that our methods improve perfor- mance when applied to a range of different predictor ar- chitectures. Lastly, we show that explicitly encoding the camera pose distribution improves the generalization per- formance of a synthetically trained depth predictor when evaluated on real images.more » « less
-
Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle“ approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.more » « less
-
null (Ed.)Deep learning approaches currently achieve the state-of-the-art results on camera-based vital signs measurement. One of the main challenges with using neural models for these applications is the lack of sufficiently large and diverse datasets. Limited data increases the chances of overfitting models to the available data which in turn can harm generalization. In this paper, we show that the generalizability of imaging photoplethysmography models can be improved by augmenting the training set with "magnified" videos. These augmentations are specifically designed to reveal useful features for recovering the photoplethysmogram. We show that using augmentations of this form is more effective at improving model robustness than other commonly used data augmentation approaches. We show better within-dataset and especially cross-dataset performance with our proposed data augmentation approach on three publicly available datasets.more » « less