skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Forecasting Time-to-Collision from Monocular Video: Feasibility, Dataset, and Challenges
We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates or velocity information to predict future collisions. While previous work has focused on detecting immediate collision in the context of navigating Unmanned Aerial Vehicles, the detection was limited to a binary variable (i.e., collision or no collision). We propose a more fine-grained approach to collision forecasting by predicting the exact time to collision in terms of milliseconds, which is more helpful for collision avoidance in the context of dynamic path planning. To evaluate our method, we have collected a novel dataset of over 13,000 indoor video segments each showing a trajectory of at least one person ending in a close proximity (a near collision) with the camera mounted on a mobile suitcase-shaped platform. Using this dataset, we do extensive experimentation on different temporal windows as input using an exhaustive list of state-of-the-art convolutional neural networks (CNNs). Our results show that our proposed multi-stream CNN is the best model for predicting time to near-collision. The average prediction error of our time to near-collision is 0.75 seconds across the test videos. The project webpage can be found at https://aashi7.github.io/NearCollision.html.  more » « less
Award ID(s):
1637927
PAR ID:
10308758
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We explore the possibility of using a single monocular camera to forecast the time to collision between a suitcase-shaped robot being pushed by its user and other nearby pedestrians. We develop a purely image-based deep learning approach that directly estimates the time to collision without the need of relying on explicit geometric depth estimates or velocity information to predict future collisions. While previous work has focused on detecting immediate collision in the context of navigating Unmanned Aerial Vehicles, the detection was limited to a binary variable (i.e., collision or no collision). We propose a more fine-grained approach to collision forecasting by predicting the exact time to collision in terms of milliseconds, which is more helpful for collision avoidance in the context of dynamic path planning. To evaluate our method, we have collected a novel large-scale dataset of over 13,000 indoor video segments each showing a trajectory of at least one person ending in a close proximity (a near collision) with the camera mounted on a mobile suitcase-shaped platform. Using this dataset, we do extensive experimentation on different temporal windows as input using an exhaustive list of state-of-the-art convolutional neural networks (CNNs). Our results show that our proposed multi-stream CNN is the best model for predicting time to near-collision. The average prediction error of our time to near collision is 0.75 seconds across our test environments. 
    more » « less
  2. We present the first event-based learning approach for motion segmentation in indoor scenes and the first event-based dataset – EV-IMO – which includes accurate pixel-wise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion and a dense depth map, the network estimates independently moving object segmentation at the pixel-level and computes per-object 3D translational velocities of moving objects. We also train a shallow network with just 40k parameters, which is able to compute depth and egomotion. Our EV-IMO dataset features 32 minutes of indoor recording with up to 3 fast moving objects in the camera field of view. The objects and the camera are tracked using a VICON motion capture system. By 3D scanning the room and the objects, ground truth of the depth map and pixel-wise object masks are obtained. We then train and evaluate our learning pipeline on EV-IMO and demonstrate that it is well suited for scene constrained robotics applications. 
    more » « less
  3. Abstract As camera trapping has become a standard practice in wildlife ecology, developing techniques to extract additional information from images will increase the utility of generated data. Despite rapid advancements in camera trapping practices, methods for estimating animal size or distance from the camera using captured images have not been standardized. Deriving animal sizes directly from images creates opportunities to collect wildlife metrics such as growth rates or changes in body condition. Distances to animals may be used to quantify important aspects of sampling design such as the effective area sampled or distribution of animals in the camera's field‐of‐view.We present a method of using pixel measurements in an image to estimate animal size or distance from the camera using a conceptual model in photogrammetry known as the ‘pinhole camera model’. We evaluated the performance of this approach both using stationary three‐dimensional animal targets and in a field setting using live captive reindeerRangifer tarandusranging in size and distance from the camera.We found total mean relative error of estimated animal sizes or distances from the cameras in our simulation was −3.0% and 3.3% and in our field setting was −8.6% and 10.5%, respectively. In our simulation, mean relative error of size or distance estimates were not statistically different between image settings within camera models, between camera models or between the measured dimension used in calculations.We provide recommendations for applying the pinhole camera model in a wildlife camera trapping context. Our approach of using the pinhole camera model to estimate animal size or distance from the camera produced robust estimates using a single image while remaining easy to implement and generalizable to different camera trap models and installations, thus enhancing its utility for a variety of camera trap applications and expanding opportunities to use camera trap images in novel ways. 
    more » « less
  4. Vedaldi, Andrea; Bischof, Horst; Brox, Thomas; Frahm, Jan-Michael (Ed.)
    This paper focuses on the problem of predicting future trajectories of people in unseen scenarios and camera views. We propose a method to efficiently utilize multi-view 3D simulation data for training. Our approach finds the hardest camera view to mix up with adversarial data from the original camera view in training, thus enabling the model to learn robust representations that can generalize to unseen camera views. We refer to our method as SimAug. We show that SimAug achieves best results on three out-of-domain real-world benchmarks, as well as getting state-of-the-art in the Stanford Drone and the VIRAT/ActEV dataset with in-domain training data. We will release our models and code. 
    more » « less
  5. We present Splat-Nav, a real-time robot navigation pipeline for Gaussian splatting (GSplat) scenes, a powerful new 3-D scene representation. Splat-Nav consists of two components: first, Splat-Plan, a safe planning module, and second, Splat-Loc, a robust vision-based pose estimation module. Splat-Plan builds a safe-by-construction polytope corridor through the map based on mathematically rigorous collision constraints and then constructs a Bézier curve trajectory through this corridor. Splat-Loc provides real-time recursive state estimates given only an RGB feed from an on-board camera, leveraging the point-cloud representation inherent in GSplat scenes. Working together, these modules give robots the ability to recursively replan smooth and safe trajectories to goal locations. Goals can be specified with position coordinates, or with language commands by using a semantic GSplat. We demonstrate improved safety compared to point cloud-based methods in extensive simulation experiments. In a total of 126 hardware flights, we demonstrate equivalent safety and speed compared to motion capture and visual odometry, but without a manual frame alignment required by those methods. We show online replanning at more than 2 Hz and pose estimation at about 25 Hz, an order of magnitude faster than neural radiance field-based navigation methods, thereby enabling real-time navigation. 
    more » « less