We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates or velocity information to predict future collisions. While previous work has focused on detecting immediate collision in the context of navigating Unmanned Aerial Vehicles, the detection was limited to a binary variable (i.e., collision or no collision). We propose a more fine-grained approach to collision forecasting by predicting the exact time to collision in terms of milliseconds, which is more helpful for collision avoidance in the context of dynamic path planning. To evaluate our method, we have collected a novel dataset of over 13,000 indoor video segments each showing a trajectory of at least one person ending in a close proximity (a near collision) with the camera mounted on a mobile suitcase-shaped platform. Using this dataset, we do extensive experimentation on different temporal windows as input using an exhaustive list of state-of-the-art convolutional neural networks (CNNs). Our results show that our proposed multi-stream CNN is the best model for predicting time to near-collision. The average prediction error of our time to near-collision is 0.75 seconds across the test videos. The project webpage can be found at https://aashi7.github.io/NearCollision.html.
more »
« less
Future Near-Collision Prediction from Monocular Video: Feasibility, Dataset, and Challenges
We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates or velocity information to predict future collisions. While previous work has focused on detecting immediate collision in the context of navigating Unmanned Aerial Vehicles, the detection was limited to a binary variable (i.e., collision or no collision). We propose a more fine-grained approach to collision forecasting by predicting the exact time to collision in terms of milliseconds, which is more helpful for collision avoidance in the context of dynamic path planning. To evaluate our method, we have collected a novel large-scale dataset of over 13,000 indoor video segments each showing a trajectory of at least one person ending in a close proximity (a near collision) with the camera mounted on a mobile suitcase-shaped platform. Using this dataset, we do extensive experimentation on different temporal windows as input using an exhaustive list of state-of-the-art convolutional neural networks (CNNs). Our results show that our proposed multi-stream CNN is the best model for predicting time to near-collision. The average prediction error of our time to near collision is 0.75 seconds across our test environments.
more »
« less
- Award ID(s):
- 1637927
- PAR ID:
- 10304295
- Date Published:
- Journal Name:
- IEEE/RSJ International Conference on Intelligent Robots and Systems
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract As camera trapping has become a standard practice in wildlife ecology, developing techniques to extract additional information from images will increase the utility of generated data. Despite rapid advancements in camera trapping practices, methods for estimating animal size or distance from the camera using captured images have not been standardized. Deriving animal sizes directly from images creates opportunities to collect wildlife metrics such as growth rates or changes in body condition. Distances to animals may be used to quantify important aspects of sampling design such as the effective area sampled or distribution of animals in the camera's field‐of‐view.We present a method of using pixel measurements in an image to estimate animal size or distance from the camera using a conceptual model in photogrammetry known as the ‘pinhole camera model’. We evaluated the performance of this approach both using stationary three‐dimensional animal targets and in a field setting using live captive reindeerRangifer tarandusranging in size and distance from the camera.We found total mean relative error of estimated animal sizes or distances from the cameras in our simulation was −3.0% and 3.3% and in our field setting was −8.6% and 10.5%, respectively. In our simulation, mean relative error of size or distance estimates were not statistically different between image settings within camera models, between camera models or between the measured dimension used in calculations.We provide recommendations for applying the pinhole camera model in a wildlife camera trapping context. Our approach of using the pinhole camera model to estimate animal size or distance from the camera produced robust estimates using a single image while remaining easy to implement and generalizable to different camera trap models and installations, thus enhancing its utility for a variety of camera trap applications and expanding opportunities to use camera trap images in novel ways.more » « less
-
We present the first event-based learning approach for motion segmentation in indoor scenes and the first event-based dataset – EV-IMO – which includes accurate pixel-wise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion and a dense depth map, the network estimates independently moving object segmentation at the pixel-level and computes per-object 3D translational velocities of moving objects. We also train a shallow network with just 40k parameters, which is able to compute depth and egomotion. Our EV-IMO dataset features 32 minutes of indoor recording with up to 3 fast moving objects in the camera field of view. The objects and the camera are tracked using a VICON motion capture system. By 3D scanning the room and the objects, ground truth of the depth map and pixel-wise object masks are obtained. We then train and evaluate our learning pipeline on EV-IMO and demonstrate that it is well suited for scene constrained robotics applications.more » « less
-
Vedaldi, Andrea; Bischof, Horst; Brox, Thomas; Frahm, Jan-Michael (Ed.)This paper focuses on the problem of predicting future trajectories of people in unseen scenarios and camera views. We propose a method to efficiently utilize multi-view 3D simulation data for training. Our approach finds the hardest camera view to mix up with adversarial data from the original camera view in training, thus enabling the model to learn robust representations that can generalize to unseen camera views. We refer to our method as SimAug. We show that SimAug achieves best results on three out-of-domain real-world benchmarks, as well as getting state-of-the-art in the Stanford Drone and the VIRAT/ActEV dataset with in-domain training data. We will release our models and code.more » « less
-
We present a novel approach to multi-person multi-camera tracking based on learning the space-time continuum of a camera network. Some challenges involved in tracking multiple people in real scenarios include a) ensuring reliable continuous association of all persons, and b) accounting for presence of blind-spots or entry/exit points. Most of the existing methods design sophisticated models that require heavy tuning of parameters and it is a nontrivial task for deep learning approaches as they cannot be applied directly to address the above challenges. Here, we deal with the above points in a coherent way by proposing a discriminative spatio-temporal learning approach for tracking based on person re-identification using LSTM networks. This approach is more robust when no a-priori information about the aspect of an individual or the number of individuals is known. The idea is to identify detections as belonging to the same individual by continuous association and recovering from past errors in associating different individuals to a particular trajectory. We exploit LSTM's ability to infuse temporal information to predict the likelihood that new detections belong to the same tracked entity by jointly incorporating visual appearance features and location information. The proposed approach gives a 50% improvement in the error rate compared to the previous state-of-the-art method on the CamNeT dataset and 18% improvement as compared to the baseline approach on DukeMTMC dataset.more » « less
An official website of the United States government

