Deep learning is the discipline of training computational models that are composed of multiple layers and these methods have recently improved the state of the art in many areas as a virtue of large labeled datasets, increase in the computational power of current hardware and unsupervised training methods. Although such a dataset may not be available for lots of application areas, the representations obtained by the well-designed networks that have a large representation capacity and trained with enough data are claimed to have the ability to generalize for transfer learning. As an example application, in this work, we investigate the use of stacked autoencoders for visual object tracking, which is a challenging yet very important task in computer vision. Training of autoencoders is achieved via an auxiliary dataset and the resultant representations are utilized within the tracking-by-detection framework. Experiments, realized using a challenge toolkit, indicate that exploiting the intricate structure in auxiliary dataset via hierarchical representations contributes to the solution of visual object tracking problem.