Good features to correlate for visual tracking


Tezin Türü: Doktora

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Elektrik ve Elektronik Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2017

Öğrenci: ERHAN GÜNDOĞDU

Danışman: ABDULLAH AYDIN ALATAN

Özet:

Estimating object motion is one of the key components of video processing and the first step in applications which require video representation. Visual object tracking is one way of extracting this component, and it is one of the major problems in the field of computer vision. Numerous discriminative and generative machine learning approaches have been employed to solve this problem. Recently, correlation filter based (CFB) approaches have been popular due to their computational efficiency and notable performances on benchmark datasets. The ultimate goal of CFB approaches is to find a filter (emph{i.e.}, template) which can produce high correlation outputs around the actual object location and low correlation outputs around the locations that are far from the object. Nevertheless, CFB visual tracking methods suffer from many challenges, such as occlusion, abrupt appearance changes, fast motion and object deformation. The main reasons of these sufferings are forgetting the past poses of the objects due to the simple update stages of CFB methods, non-optimal model update rate and features that are not invariant to appearance changes of the target object. In order to address the aforementioned disadvantages of CFB visual tracking methods, this thesis includes three major contributions. First, a spatial window learning method is proposed to improve the correlation quality. For this purpose, a window that is to be element-wise multiplied by the object observation (or the correlation filter) is learned by a novel gradient descent procedure. The learned window is capable of suppressing/highlighting the necessary regions of the object, and can improve the tracking performance in the case of occlusions and object deformation. As the second contribution, an ensemble of trackers algorithm is proposed to handle the issues of non-optimal learning rate and forgetting the past poses of the object. The trackers in the ensemble are organized in a binary tree, which stores individual expert trackers at its nodes. During the course of tracking, the relevant expert trackers to the most recent object appearance are activated and utilized in the localization and update stages. The proposed ensemble method significantly improves the tracking accuracy, especially when the expert trackers are selected as the CFB trackers utilizing the proposed window learning method. The final contribution of the thesis addresses the feature learning problem specifically focused on the CFB visual tracking loss function. For this loss function, a novel backpropagation algorithm is developed to train any fully deep convolutional neural network. The proposed gradient calculation, which is required for backpropagation, is performed efficiently in both frequency and image domain, and has a linear complexity with the number of feature maps. The training of the network model is fulfilled on carefully curated datasets including well-known difficulties of visual tracking, emph{e.g.}, occlusion, object deformation and fast motion. When the learned features are integrated to the state-of-the-art CFB visual trackers, favorable tracking performance is obtained on benchmark datasets against the CFB methods that employ hand-crafted features or deep features extracted from the pre-trained classification models.