skip to main content

Search for: All records

Award ID contains: 1911230

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Recent volumetric 3D reconstruction methods can produce very accurate results, with plausible geometry even for unobserved surfaces. However, they face an undesirable trade-off when it comes to multi-view fusion. They can fuse all available view information by global averaging, thus losing fine detail, or they can heuristically cluster views for local fusion, thus restricting their ability to consider all views jointly. Our key insight is that greater detail can be retained without restricting view diversity by learning a view-fusion function conditioned on camera pose and image content. We propose to learn this multi-view fusion using a transformer. To this end, we introduce VoRTX, 1 an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. Our model is occlusion-aware, leveraging the transformer architecture to predict an initial, projective scene geometry estimate. This estimate is used to avoid back-projecting image features through surfaces into occluded regions. We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods. We also demonstrate generalization without any fine-tuning, outperforming the same state-of-the-art methods on two other datasets, TUM-RGBD and ICL-NUIM.
  2. We present 3DVNet, a novel multi-view stereo (MVS) depth-prediction method that combines the advantages of previous depth-based and volumetric MVS approaches. Our key idea is the use of a 3D scene-modeling network that iteratively updates a set of coarse depth predictions, resulting in highly accurate predictions which agree on the underlying scene geometry. Unlike existing depth-prediction techniques, our method uses a volumetric 3D convolutional neural network (CNN) that operates in world space on all depth maps jointly. The network can therefore learn meaningful scene-level priors. Furthermore, unlike existing volumetric MVS techniques, our 3D CNN operates on a feature-augmented point cloud, allowing for effective aggregation of multi-view information and flexible iterative refinement of depth maps. Experimental results show our method exceeds state-of-the-art accuracy in both depth prediction and 3D reconstruction metrics on the ScanNet dataset, as well as a selection of scenes from the TUM-RGBD and ICL-NUIM datasets. This shows that our method is both effective and generalizes to new settings.
  3. Synthetic data is highly useful for training machine learning systems performing image-based 3D reconstruction, as synthetic data has applications in both extending existing generalizable datasets and being tailored to train neural networks for specific learning tasks of interest. In this paper, we introduce and utilize a synthetic data generation suite capable of generating data given existing 3D scene models as input. Specifically, we use our tool to generate image sequences for use with Multi-View Stereo (MVS), moving a camera through the virtual space according to user-chosen camera parameters. We evaluate how the given camera parameters and type of 3D environment affect how applicable the generated image sequences are to the MVS task using five pre-trained neural networks on image sequences generated from three different 3D scene datasets. We obtain generated predictions for each combination of parameter value and input image sequence, using standard error metrics to analyze the differences in depth predictions on image sequences across 3D datasets, parameters, and networks. Among other results, we find that camera height and vertical camera viewing angle are the parameters that cause the most variation in depth prediction errors on these image sequences.
  4. Imperfect labels are ubiquitous in real-world datasets. Several recent successful methods for training deep neural networks (DNNs) robust to label noise have used two primary techniques: filtering samples based on loss during a warm-up phase to curate an initial set of cleanly labeled samples, and using the output of a network as a pseudo-label for subsequent loss calculations. In this paper, we evaluate different augmentation strategies for algorithms tackling the "learning with noisy labels" problem. We propose and examine multiple augmentation strategies and evaluate them using synthetic datasets based on CIFAR-10 and CIFAR-100, as well as on the real-world dataset Clothing1M. Due to several commonalities in these algorithms, we find that using one set of augmentations for loss modeling tasks and another set for learning is the most effective, improving results on the state-of-the-art and other previous methods. Furthermore, we find that applying augmentation during the warm-up period can negatively impact the loss convergence behavior of correctly versus incorrectly labeled samples. We introduce this augmentation strategy to the state-of-the-art technique and demonstrate that we can improve performance across all evaluated noise levels. In particular, we improve accuracy on the CIFAR-10 benchmark at 90% symmetric noise by more than 15% inmore »absolute accuracy, and we also improve performance on the Clothing1M dataset.« less
  5. Modern machine learning algorithms typically require large amounts of labeled training data to fit a reliable model. To minimize the cost of data collection, researchers often employ techniques such as crowdsourcing and web scraping. However, web data and human annotations are known to exhibit high margins of error, resulting in sizable amounts of incorrect labels. Poorly labeled training data can cause models to overfit to the noise distribution, crippling performance in real-world applications. In this work, we investigate the viability of using data augmentation in conjunction with semi-supervised learning to improve the label noise robustness of image classification models. We conduct several experiments using noisy variants of the CIFAR-10 image classification dataset to benchmark our method against existing algorithms. Experimental results show that our augmentative SSL approach improves upon the state-of-the-art.
  6. Location-based or Out-of-Home Entertainment refers to experiences such as theme and amusement parks, laser tag and paintball arenas, roller and ice skating rinks, zoos and aquariums, or science centers and museums among many other family entertainment and cultural venues. More recently, location-based VR has emerged as a new category of out-of-home entertainment. These VR experiences can be likened to social entertainment options such as laser tag, where physical movement is an inherent part of the experience versus at-home VR experiences where physical movement often needs to be replaced by artificial locomotion techniques due to tracking space constraints. In this work, we present the first VR study to understand the impact of natural walking in a large physical space on presence and user preference. We compare it with teleportation in the same large space, since teleportation is the most commonly used locomotion technique for consumer, at-home VR. Our results show that walking was overwhelmingly preferred by the participants and teleportation leads to significantly higher self-reported simulator sickness. The data also shows a trend towards higher self-reported presence for natural walking.
  7. Many researchers and industry professionals believe Augmented Reality (AR) to be the next step in personal computing. However, the idea of an always-on context-aware AR device presents new and unique challenges to the way users organize multiple streams of information. What does multitasking look like and when should applications be tied to specific elements in the environment? In this exploratory study, we look at one such element: physical objects, and explore an object-centric approach to multitasking in AR. We developed 3 prototype applications that operate on a subset of objects in a simulated test environment. We performed a pilot study of our multitasking solution with a novice user, domain expert, and system expert to develop insights into the future of AR application design.
  8. In this paper, we look at how depth data can benefit existing object masking methods applied in occluded scenes. Masking the pixel locations of objects within scenes helps computers get a spatial awareness of where objects are within images. The current state-of-the-art algorithm for masking objects in images is Mask R-CNN, which builds on the Faster R-CNN network to mask object pixels rather than just detecting their bounding boxes. This paper examines the weaknesses Mask R-CNN has in masking people when they are occluded in a frame. It then looks at how depth data gathered from an RGB-D sensor can be used. We provide a case study to show how simply applying thresholding methods on the depth information can aid in distinguishing occluded persons. The intention of our research is to examine how features from depth data can benefit object pixel masking methods in an explainable manner, especially in complex scenes with multiple objects.