null
(Ed.)
The task of instance segmentation in videos aims to consistently
identify objects at pixel level throughout the entire video sequence.
Existing state-of-the-art methods either follow the tracking-bydetection
paradigm to employ multi-stage pipelines or directly
train a complex deep model to process the entire video clips as 3D
volumes. However, these methods are typically slow and resourceconsuming
such that they are often limited to offline processing.
In this paper, we propose SRNet, a simple and efficient framework
for joint segmentation and tracking of object instances in videos.
The key to achieving both high efficiency and accuracy in our
framework is to formulate the instance segmentation and tracking
problem into a unified spatial-relation learning task where each
pixel in the current frame relates to its object center, and each object
center relates to its location in the previous frame. This unified
learning framework allows our framework to perform join instance
segmentation and tracking through a single stage while maintaining
low overheads among different learning tasks. Our proposed
framework can handle two different task settings and demonstrates
comparable performance with state-of-the-art methods on two different
benchmarks while running significantly faster.
more »
« less