Clustering of manifold-modeled data based on tangent space variations

Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Elektrik ve Elektronik Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2017

Öğrenci: GÖKHAN GÖKDOĞAN

Danışman: ELİF VURAL

Özet:

An important research topic of the recent years has been to understand and analyze data collections for clustering and classification applications. In many data analysis problems, the data sets at hand have an intrinsically low-dimensional structure and admit a manifold model. Most state-of-the-art clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate difficult sampling conditions only to some extent, and may fail for scarcely sampled data manifolds or at high-curvature regions. In this thesis, we consider a setting where each cluster is concentrated around a manifold and propose a manifold clustering algorithm that relies on the observation that the variation of the tangent space must be consistent along curves over the same data manifold. We argue that the non linear geometric structure of manifold-modeled data sets can be better handled by taking into account the global data geometry via the change in the tangent space over the whole manifold. We first theoretically characterize some properties of manifolds of bounded curvature. We then use these observations to develop a geometry-based clustering approach. Finally, we evaluate the performance of the presented method with experiments on synthetic and real data sets and the results show that the proposed method outperforms the manifold clustering algorithms in comparison based on Euclidean distance, geodesic distance and sparse representations in some kind of data sets. Our study suggests that geometry-based dissimilarity measures can provide promising tools for the clustering of intrinsically low dimensional data sets.