skip to main content

Title: Robust Long-Term Object Tracking via Improved Discriminative Model Prediction
We propose an improved discriminative model prediction method for robust long-term tracking based on a pre-trained short-term tracker. The baseline pre-trained short-term tracker is SuperDiMP which combines the bounding-box regressor of PrDiMP with the standard DiMP classifier. Our tracker RLT-DiMP improves SuperDiMP in the follow- ing three aspects: (1) Uncertainty reduction using random erasing: To make our model robust, we exploit an agreement from multiple im- ages after erasing random small rectangular areas as a certainty. And then, we correct the tracking state of our model accordingly. (2) Ran- dom search with spatio-temporal constraints: we propose a robust ran- dom search method with a score penalty applied to prevent the prob- lem of sudden detection at a distance. (3) Background augmentation for more discriminative feature learning: We augment various backgrounds that are not included in the search area to train a more robust model in the background clutter. In experiments on the VOT-LT2020 bench- mark dataset, the proposed method achieves comparable performance to the state-of-the-art long-term trackers. The source code is available at:  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Bartoli, A; Fusiello, A
Date Published:
Journal Name:
Computer Vision – ECCV 2020 Workshops
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Self-tracking using commodity wearables such as smartwatches can help older adults reduce sedentary behaviors and engage in physical activity. However, activity recognition applications that are typically deployed in these wearables tend to be trained on datasets that best represent younger adults. We explore how our activity recognition model, a hybrid of long short-term memory and convolutional layers, pre-trained on smartwatch data from younger adults, performs on older adult data. We report results on week-long data from two older adults collected in a preliminary study in the wild with ground-truth annotations based on activPAL, a thigh-worn sensor. We find that activity recognition for older adults remains challenging even when comparing our model’s performance to state of the art deployed models such as the Google Activity Recognition API. More so, we show that models trained on younger adults tend to perform worse on older adults. 
    more » « less
  2. Avidan, S. (Ed.)
    Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks. 
    more » « less
  3. Visual tracking has achieved remarkable success in recent decades, but it remains a challenging problem due to appearance variations over time and complex cluttered background. In this paper, we adopt a tracking-by-verification scheme to overcome these challenges by determining the patch in the subsequent frame that is most similar to the target template and distinctive to the background context. A multi-stream deep similarity learning network is proposed to learn the similarity comparison model. The loss function of our network encourages the distance between a positive patch in the search region and the target template to be smaller than that between positive patch and the background patches. Within the learned feature space, even if the distance between positive patches becomes large caused by the appearance change or interference of background clutter, our method can use the relative distance to distinguish the target robustly. Besides, the learned model is directly used for tracking with no need of model updating, parameter fine-tuning and can run at 45 fps on a single GPU. Our tracker achieves state-of-the-art performance on the visual tracking benchmark compared with other recent real-time-speed trackers, and shows better capability in handling background clutter, occlusion and appearance change.

    more » « less
  4. Synapses change on multiple timescales, ranging from milliseconds to minutes, due to a combination of both short- and long-term plasticity. Here we develop an extension of the common generalized linear model to infer both short- and long-term changes in the coupling between a pre- and postsynaptic neuron based on observed spiking activity. We model short-term synaptic plasticity using additive effects that depend on the presynaptic spike timing, and we model long-term changes in both synaptic weight and baseline firing rate using point process adaptive smoothing. Using simulations, we first show that this model can accurately recover time-varying synaptic weights (1) for both depressing and facilitating synapses, (2) with a variety of long-term changes (including realistic changes, such as due to STDP), (3) with a range of pre and postsynaptic firing rates, and (4) for both excitatory and inhibitory synapses. We then apply our model to two experimentally recorded putative synaptic connections. We find that simultaneously tracking fast changes in synaptic weights, slow changes in synaptic weights, and unexplained variations in baseline firing is essential. Omitting any one of these factors can lead to spurious inferences for the others. Altogether, this model provides a flexible framework for tracking short- and long-term variation in spike transmission. 
    more » « less
  5. The more new features that are being added to smartphones, the harder it becomes for users to find them. This is because the feature names are usually short and there are just too many of them for the users to remember the exact words. The users are more comfortable asking contextual queries that describe the features they are looking for, but the standard term frequency-based search cannot process them. This paper presents a novel retrieval system for mobile features that accepts intuitive and contextual search queries. We trained a relevance model via contrastive learning from a pre-trained language model to perceive the contextual relevance between a query embedding and indexed mobile features. Also, to make it efficiently run on-device using minimal resources, we applied knowledge distillation to compress the model without degrading much performance. To verify the feasibility of our method, we collected test queries and conducted comparative experiments with the currently deployed search baselines. The results show that our system outperforms the others on contextual sentence queries and even on usual keyword-based queries. 
    more » « less