skip to main content

Title: Sparse Representations for Object- and Ego-Motion Estimations in Dynamic Scenes
Disentangling the sources of visual motion in a dynamic scene during self-movement or ego motion is important for autonomous navigation and tracking. In the dynamic image segments of a video frame containing independently moving objects, optic flow relative to the next frame is the sum of the motion fields generated due to camera and object motion. The traditional ego-motion estimation methods assume the scene to be static, and the recent deep learning-based methods do not separate pixel velocities into object- and ego-motion components. We propose a learning-based approach to predict both ego-motion parameters and object-motion field (OMF) from image sequences using a convolutional autoencoder while being robust to variations due to the unconstrained scene depth. This is achieved by: 1) training with continuous ego-motion constraints that allow solving for ego-motion parameters independently of depth and 2) learning a sparsely activated overcomplete ego-motion field (EMF) basis set, which eliminates the irrelevant components in both static and dynamic segments for the task of ego-motion estimation. In order to learn the EMF basis set, we propose a new differentiable sparsity penalty function that approximates the number of nonzero activations in the bottleneck layer of the autoencoder and enforces sparsity more effectively than L1- more » and L2-norm-based penalties. Unlike the existing direct ego-motion estimation methods, the predicted global EMF can be used to extract OMF directly by comparing it against the optic flow. Compared with the state-of-the-art baselines, the proposed model performs favorably on pixelwise object- and ego-motion estimation tasks when evaluated on real and synthetic data sets of dynamic scenes. « less
; ;
Award ID(s):
1813785 2120019
Publication Date:
Journal Name:
IEEE Transactions on Neural Networks and Learning Systems
Page Range or eLocation-ID:
1 to 14
Sponsoring Org:
National Science Foundation
More Like this
  1. We present the first event-based learning approach for motion segmentation in indoor scenes and the first event-based dataset – EV-IMO – which includes accurate pixel-wise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion and a dense depth map, the network estimates independently moving object segmentation at the pixel-level and computes per-object 3D translational velocities of moving objects. We also train a shallow network with just 40k parameters, which is able to compute depthmore »and egomotion. Our EV-IMO dataset features 32 minutes of indoor recording with up to 3 fast moving objects in the camera field of view. The objects and the camera are tracked using a VICON motion capture system. By 3D scanning the room and the objects, ground truth of the depth map and pixel-wise object masks are obtained. We then train and evaluate our learning pipeline on EV-IMO and demonstrate that it is well suited for scene constrained robotics applications.« less
  2. Event-based cameras have been designed for scene motion perception - their high temporal resolution and spatial data sparsity converts the scene into a volume of boundary trajectories and allows to track and analyze the evolution of the scene in time. Analyzing this data is computationally expensive, and there is substantial lack of theory on dense-in-time object motion to guide the development of new algorithms; hence, many works resort to a simple solution of discretizing the event stream and converting it to classical pixel maps, which allows for application of conventional image processing methods. In this work we present a Graphmore »Convolutional neural network for the task of scene motion segmentation by a moving camera. We convert the event stream into a 3D graph in (x,y,t) space and keep per-event temporal information. The difficulty of the task stems from the fact that unlike in metric space, the shape of an object in (x,y,t) space depends on its motion and is not the same across the dataset. We discuss properties of of the event data with respect to this 3D recognition problem, and show that our Graph Convolutional architecture is superior to PointNet++. We evaluate our method on the state of the art event-based motion segmentation dataset - EV-IMO and perform comparisons to a frame-based method proposed by its authors. Our ablation studies show that increasing the event slice width improves the accuracy, and how subsampling and edge configurations affect the network performance.« less
  3. We present an unsupervised learning framework for simultaneously training single-view depth prediction and optical flow estimation models using unlabeled video sequences. Existing unsupervised methods often exploit brightness constancy and spatial smoothness priors to train depth or flow models. In this paper, we propose to leverage geometric consistency as additional supervisory signals. Our core idea is that for rigid regions we can use the predicted scene depth and camera motion to synthesize 2D optical flow by backprojecting the induced 3D scene flow. The discrepancy between the rigid flow (from depth prediction and camera motion) and the estimated flow (from optical flowmore »model) allows us to impose a cross-task consistency loss. While all the networks are jointly optimized during training, they can be applied independently at test time. Extensive experiments demonstrate that our depth and flow models compare favorably with state-of-the-art unsupervised methods.« less
  4. Drilling and milling operations are material removal processes involved in everyday conventional productions, especially in the high-speed metal cutting industry. The monitoring of tool information (wear, dynamic behavior, deformation, etc.) is essential to guarantee the success of product fabrication. Many methods have been applied to monitor the cutting tools from the information of cutting force, spindle motor current, vibration, as well as sound acoustic emission. However, those methods are indirect and sensitive to environmental noises. Here, the in-process imaging technique that can capture the cutting tool information while cutting the metal was studied. As machinists judge whether a tool ismore »worn-out by the naked eye, utilizing the vision system can directly present the performance of the machine tools. We proposed a phase shifted strobo-stereoscopic method (Figure 1) for three-dimensional (3D) imaging. The stroboscopic instrument is usually applied for the measurement of fast-moving objects. The operation principle is as follows: when synchronizing the frequency of the light source illumination and the motion of object, the object appears to be stationary. The motion frequency of the target is transferring from the count information of the encoder signals from the working rotary spindle. If small differences are added to the frequency, the object appears to be slowly moving or rotating. This effect can be working as the source for the phase-shifting; with this phase information, the target can be whole-view 3D reconstructed by 360 degrees. The stereoscopic technique is embedded with two CCD cameras capturing images that are located bilateral symmetrically in regard to the target. The 3D scene is reconstructed by the location information of the same object points from both the left and right images. In the proposed system, an air spindle was used to secure the motion accuracy and drilling/milling speed. As shown in Figure 2, two CCDs with 10X objective lenses were installed on a linear rail with rotary stages to capture the machine tool bit raw picture for further 3D reconstruction. The overall measurement process was summarized in the flow chart (Figure 3). As the count number of encoder signals is related to the rotary speed, the input speed (unit of RPM) was set as the reference signal to control the frequency (f0) of the illumination of the LED. When the frequency was matched with the reference signal, both CCDs started to gather the pictures. With the mismatched frequency (Δf) information, a sequence of images was gathered under the phase-shifted process for a whole-view 3D reconstruction. The study in this paper was based on a 3/8’’ drilling tool performance monitoring. This paper presents the principle of the phase-shifted strobe-stereoscopic 3D imaging process. A hardware set-up is introduced, , as well as the 3D imaging algorithm. The reconstructed image analysis under different working speeds is discussed, the reconstruction resolution included. The uncertainty of the imaging process and the built-up system are also analyzed. As the input signal is the working speed, no other information from other sources is required. This proposed method can be applied as an on-machine or even in-process metrology. With the direct method of the 3D imaging machine vision system, it can directly offer the machine tool surface and fatigue information. This presented method can supplement the blank for determining the performance status of the machine tools, which further guarantees the fabrication process.« less
  5. In order for robots to operate effectively in homes and workplaces, they must be able to manipulate the articulated objects common within environments built for and by humans. Kinematic models provide a concise representation of these objects that enable deliberate, generalizable manipulation policies. However, existing approaches to learning these models rely upon visual observations of an object’s motion, and are subject to the effects of occlusions and feature sparsity. Natural language descriptions provide a flexible and efficient means by which humans can provide complementary information in a weakly supervised manner suitable for a variety of different interactions (e.g., demonstrations andmore »remote manipulation). In this paper, we present a multimodal learning framework that incorporates both vision and language information acquired in situ to estimate the structure and parameters that de- fine kinematic models of articulated objects. The visual signal takes the form of an RGB-D image stream that opportunistically captures object motion in an unprepared scene. Accompanying natural language descriptions of the motion constitute the linguistic signal. We model linguistic information using a probabilistic graphical model that grounds natural language descriptions to their referent kinematic motion. By exploiting the complementary nature of the vision and language observations, our method infers correct kinematic models for various multiple-part objects on which the previous state-of-the- art, visual-only system fails. We evaluate our multimodal learning framework on a dataset comprised of a variety of household objects, and demonstrate a 23% improvement in model accuracy over the vision-only baseline.« less