Defining hand-crafted feature representations needs expert knowledge, requires timeconsuming manual adjustments, and besides, it is arguably one of the limiting factors of object tracking. In this paper, we propose a novel solution to automatically relearn the most useful feature representations during the tracking process in order to accurately adapt appearance changes, pose and scale variations while preventing from drift and tracking failures. We employ a candidate pool of multiple Convolutional Neural Networks (CNNs) as a data-driven model of different instances of the target object. Individually, each CNN maintains a specific set of kernels that favourably discriminate object patches from their surrounding background using all available low-level cues. These kernels are updated in an online manner at each frame after being trained with just one instance at the initialization of the corresponding CNN. Given a frame, the most promising CNNs in the pool are selected to evaluate the hypothesises for the target object. The hypothesis with the highest score is assigned as the current detection window and the selected models are retrained using a warm-start back-propagation which optimizes a structural loss function. In addition to the model-free tracker, we introduce a class-specific version of the proposed method that is tailored for tracking of a particular object class such as human faces. Our experiments on a large selection of videos from the recent benchmarks demonstrate that our method outperforms the existing state-of-the-art algorithms and rarely loses the track of the target object.
Alexey DosovitskiyPhilipp FischerJost Tobias SpringenbergMartin RiedmillerThomas Brox
Jianyuan GuoYuhui YuanChao Zhang