skip to main content

Title: Dynamic-GAN: Learning Spatial-Temporal Attention for Dynamic Object Removal in Feature Dense Environments
This paper presents an attention-based, deep learning framework that converts robot camera frames with dynamic content into static frames to more easily apply simultaneous localization and mapping (SLAM) algorithms. The vast majority of SLAM methods have difficulty in the presence of dynamic objects appearing in the environment and occluding the area being captured by the camera. Despite past attempts to deal with dynamic objects, challenges remain to reconstruct large, occluded areas with complex backgrounds. Our proposed Dynamic-GAN framework employs a generative adversarial network to remove dynamic objects from a scene and inpaint a static image free of dynamic objects. The Dynamic-GAN framework utilizes spatial-temporal transformers, and a novel spatial-temporal loss function. The evaluation of Dynamic-GAN was comprehensively conducted both quantitatively and qualitatively by testing it on benchmark datasets, and on a mobile robot in indoor navigation environments. As people appeared dynamically in close proximity to the robot, results showed that large, feature-rich occluded areas can be accurately reconstructed with our attention-based deep learning framework for dynamic object removal. Through experiments we demonstrate that our proposed algorithm has up to 25% better performance on average as compared to the standard benchmark algorithms.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Dynamic-GAN: Learning Spatial-Temporal Attention for Dynamic Object Removal in Feature Dense Environments
Page Range / eLocation ID:
12189 to 12195
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Intelligent multi-purpose robotic assistants have the potential to assist nurses with a variety of non-critical tasks, such as object fetching, disinfecting areas, or supporting patient care. This paper focuses on enabling a multi-purpose robot to guide patients while walking. The proposed robotic framework aims at enabling a robot to learn how to navigate a crowded hospital environment while maintaining contact with the patient. Two deep reinforcement learning models are developed; the first model considers only dynamic obstacles (e.g., humans), while the second model considers static and dynamic obstacles in the environment. The models output the robot’s velocity based on the following inputs; the patient’s gait velocity, which is computed based on a leg detection method, spatial and temporal information from the environment, the humans in the scene, and the robot. The proposed models demonstrate promising results. Finally, the model that considers both static and dynamic obstacles is successfully deployed in the Gazebo simulation environment. 
    more » « less
  2. null (Ed.)
    Localizing the camera in a known indoor environment is a key building block for scene mapping, robot navigation, AR, etc. Recent advances estimate the camera pose via optimization over the 2D/3D-3D correspondences established between the coordinates in 2D/3D camera space and 3D world space. Such a mapping is estimated with either a convolution neural network or a decision tree using only the static input image sequence, which makes these approaches vulnerable to dynamic indoor environments that are quite common yet challenging in the real world. To address the aforementioned issues, in this paper, we propose a novel outlier-aware neural tree which bridges the two worlds, deep learning and decision tree approaches. It builds on three important blocks: (a) a hierarchical space partition over the indoor scene to construct the decision tree; (b) a neural routing function, implemented as a deep classification network, employed for better 3D scene understanding; and (c) an outlier rejection module used to filter out dynamic points during the hierarchical routing process. Our proposed algorithm is evaluated on the RIO-10 benchmark developed for camera relocalization in dynamic indoor environments. It achieves robust neural routing through space partitions and outperforms the state-of-the-art approaches by around 30% on camera pose accuracy, while running comparably fast for evaluation. 
    more » « less
  3. Video sequences contain rich dynamic patterns, such as dynamic texture patterns that exhibit stationarity in the temporal domain, and action patterns that are non-stationary in either spatial or temporal domain. We show that an energy-based spatial-temporal generative ConvNet can be used to model and synthesize dynamic patterns. The model defines a probability distribution on the video sequence, and the log probability is defined by a spatial-temporal ConvNet that consists of multiple layers of spatial-temporal filters to capture spatial-temporal patterns of different scales. The model can be learned from the training video sequences by an “analysis by synthesis” learning algorithm that iterates the following two steps. Step 1 synthesizes video sequences from the currently learned model. Step 2 then updates the model parameters based on the difference between the synthesized video sequences and the observed training sequences. We show that the learning algorithm can synthesize realistic dynamic patterns. We also show that it is possible to learn the model from incomplete training sequences with either occluded pixels or missing frames, so that model learning and pattern completion can be accomplished simultaneously. 
    more » « less
  4. Objective. Dynamic positron emission tomography (PET) imaging, which can provide information on dynamic changes in physiological metabolism, is now widely used in clinical diagnosis and cancer treatment. However, the reconstruction from dynamic data is extremely challenging due to the limited counts received in individual frame, especially in ultra short frames. Recently, the unrolled modelbased deep learning methods have shown inspiring results for low-count PET image reconstruction with good interpretability. Nevertheless, the existing model-based deep learning methods mainly focus on the spatial correlations while ignore the temporal domain. Approach. In this paper, inspired by the learned primal dual (LPD) algorithm, we propose the spatio-temporal primal dual network (STPDnet) for dynamic low-count PET image reconstruction. Both spatial and temporal correlations are encoded by 3D convolution operators. The physical projection of PET is embedded in the iterative learning process of the network, which provides the physical constraints and enhances interpretability. Main results. The experiments of both simulation data and real rat scan data have shown that the proposed method can achieve substantial noise reduction in both temporal and spatial domains and outperform the maximum likelihood expectation maximization, spatio-temporal kernel method, LPD and FBPnet. Significance. Experimental results show STPDnet better reconstruction performance in the low count situation, which makes the proposed method particularly suitable in whole-body dynamic imaging and parametric PET imaging that require extreme short frames and usually suffer from high level of noise. 
    more » « less
  5. To obtain more consistent measurements through the course of a wheat growing season, we conceived and designed an autonomous robotic platform that performs collision avoidance while navigating in crop rows using spatial artificial intelligence (AI). The main constraint the agronomists have is to not run over the wheat while driving. Accordingly, we have trained a spatial deep learning model that helps navigate the robot autonomously in the field while avoiding collisions with the wheat. To train this model, we used publicly available databases of prelabeled images of wheat, along with the images of wheat that we have collected in the field. We used the MobileNet single shot detector (SSD) as our deep learning model to detect wheat in the field. To increase the frame rate for real-time robot response to field environments, we trained MobileNet SSD on the wheat images and used a new stereo camera, the Luxonis Depth AI Camera. Together, the newly trained model and camera could achieve a frame rate of 18–23 frames per second (fps)—fast enough for the robot to process its surroundings once every 2–3 inches of driving. Once we knew the robot accurately detects its surroundings, we addressed the autonomous navigation of the robot. The new stereo camera allows the robot to determine its distance from the trained objects. In this work, we also developed a navigation and collision avoidance algorithm that utilizes this distance information to help the robot see its surroundings and maneuver in the field, thereby precisely avoiding collisions with the wheat crop. Extensive experiments were conducted to evaluate the performance of our proposed method. We also compared the quantitative results obtained by our proposed MobileNet SSD model with those of other state-of-the-art object detection models, such as the YOLO V5 and Faster region-based convolutional neural network (R-CNN) models. The detailed comparative analysis reveals the effectiveness of our method in terms of both model precision and inference speed.

    more » « less