Clustering of manifold-modeled data based on tangent space variations


Thesis Type: Postgraduate

Institution Of The Thesis: Orta Doğu Teknik Üniversitesi, Faculty of Engineering, Department of Electrical and Electronics Engineering, Turkey

Approval Date: 2017

Student: GÖKHAN GÖKDOĞAN

Supervisor: ELİF VURAL

Abstract:

An important research topic of the recent years has been to understand and analyze data collections for clustering and classification applications. In many data analysis problems, the data sets at hand have an intrinsically low-dimensional structure and admit a manifold model. Most state-of-the-art clustering methods developed for data of non-linear and low-dimensional structure are based on local linearity assumptions. However, clustering algorithms based on locally linear representations can tolerate difficult sampling conditions only to some extent, and may fail for scarcely sampled data manifolds or at high-curvature regions. In this thesis, we consider a setting where each cluster is concentrated around a manifold and propose a manifold clustering algorithm that relies on the observation that the variation of the tangent space must be consistent along curves over the same data manifold. We argue that the non linear geometric structure of manifold-modeled data sets can be better handled by taking into account the global data geometry via the change in the tangent space over the whole manifold. We first theoretically characterize some properties of manifolds of bounded curvature. We then use these observations to develop a geometry-based clustering approach. Finally, we evaluate the performance of the presented method with experiments on synthetic and real data sets and the results show that the proposed method outperforms the manifold clustering algorithms in comparison based on Euclidean distance, geodesic distance and sparse representations in some kind of data sets. Our study suggests that geometry-based dissimilarity measures can provide promising tools for the clustering of intrinsically low dimensional data sets.