skip to main content

Search for: All records

Creators/Authors contains: "Frahm, Jan-Michael"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. We present a new approach, EgoGlass, towards egocentric motion-capture and human pose estimation. EgoGlass is a lightweight eyeglass frame with two cameras mounted on it. Our first contribution is a new egocentric motion-capture device that adds next to no extra burden on the user and a dataset of real people doing a diverse set of actions captured by EgoGlass. Second, we propose to utilize body part information for human pose detection - to help tackle the problems of limited body coverage and self-occlusions caused by the egocentric viewpoint and cameras’ proximity to the human body. We also propose a concept of pseudo-limb mask as an alternative for segmentation mask when ground truth segmentation mask is absent for egocentric images with real subject. We demonstrate that our method achieves better results than the counterpart method without body part information on our dataset. We also test our method on two existing egocentric datasets: xR-EgoPose and EgoCap. Our method achieves state-of-the-art results on xR-EgoPose and is on par with existing method for EgoCap without requiring temporal information or personalization for each individual user.
  2. We propose a boundary-aware multi-task deep-learning- based framework for fast 3D building modeling from a sin- gle overhead image. Unlike most existing techniques which rely on multiple images for 3D scene modeling, we seek to model the buildings in the scene from a single overhead im- age by jointly learning a modified signed distance function (SDF) from the building boundaries, a dense heightmap of the scene, and scene semantics. To jointly train for these tasks, we leverage pixel-wise semantic segmentation and normalized digital surface maps (nDSM) as supervision, in addition to labeled building outlines. At test time, buildings in the scene are automatically modeled in 3D using only an input overhead image. We demonstrate an increase in building modeling performance using a multi-feature net- work architecture that improves building outline detection by considering network features learned for the other jointly learned tasks. We also introduce a novel mechanism for ro- bustly refining instance-specific building outlines using the learned modified SDF. We verify the effectiveness of our method on multiple large-scale satellite and aerial imagery datasets, where we obtain state-of-the-art performance in the 3D building reconstruction task.
  3. Despite remarkable improvements in speed and accuracy, convolutional neural networks (CNNs) still typically operate as monolithic entities at inference time. This poses a challenge for resource-constrained practical applications, where both computational budgets and performance needs can vary with the situation. To address these constraints, we propose the Any-Width Network (AWN), an adjustable-width CNN architecture and associated training routine that allow for fine-grained control over speed and accuracy during inference. Our key innovation is the use of lower-triangular weight matrices which explicitly address width-varying batch statistics while being naturally suited for multi-width operations. We also show that this design facilitates an efficient training routine based on random width sampling. We empirically demonstrate that our proposed AWNs compare favorably to existing methods while providing maximally granular control during inference.
  4. Vision-based perception systems are crucial for profitable autonomous-driving vehicle products. High accuracy in such perception systems is being enabled by rapidly evolving convolution neural networks (CNNs). To achieve a better understanding of its surrounding environment, a vehicle must be provided with full coverage via multiple cameras. However, when processing multiple video streams, existing CNN frameworks often fail to provide enough inference performance, particularly on embedded hardware constrained by size, weight, and power limits. This paper presents the results of an industrial case study that was conducted to re-think the design of CNN software to better utilize available hardware resources. In this study, techniques such as parallelism, pipelining, and the merging of per-camera images into a single composite image were considered in the context of a Drive PX2 embedded hardware platform. The study identifies a combination of techniques that can be applied to increase throughput (number of simultaneous camera streams) without significantly increasing per-frame latency (camera to CNN output) or reducing per-stream accuracy.