Mixed integer programming and heuristics approaches for clustering with cluster-based feature selection

Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Fen Bilimleri Enstitüsü, Yöneylem Araştırması Anabilim Dalı, Türkiye

Tezin Onay Tarihi: 2019

Tezin Dili: İngilizce

Öğrenci: SENA ÖNEN ÖZ

Danışman: Cem İyigün

Özet:

Cluster analysis tries to figure out the hidden similarities between data points in orderto place similar data points into the same group and different data points into separategroups using unlabeled data. Understanding the data becomes difficult and the powerof obtaining informative clusters for an algorithm decreases as the dimensionality ofthe data set gets high. Identifying the relevant features of high dimensional data setsis the mostly used technique in order to increase the performance of the algorithm tofind the best clusters. However, selecting or deselecting the features comes up withthe assumption that all the selected features have the same relevance for all clusters.In this study, it is assumed that the features to be used in clustering may differ foreach cluster. Number of clusters and number of relevant features in each clusterare given in advance. By using a center-based clustering approach, identifying thecluster centers, assigning data points to a cluster and selecting relevant features foreach cluster are performed simultaneously. A mixed integer mathematical model isproposed which minimizes the total distance between data points and their clustercenter by using the selected features for each cluster. Since the proposed model is not linear, mathematical models using different linearization methods have been used tosolve the problem. In addition to those mathematical models, we propose BendersDecomposition solution method implemented on our problem. Besides, two differentheuristic algorithms have been developed by taking into account the nature of thementioned problem. The proposed mathematical models and heuristic algorithmshave been experimented on several data sets in different problem sizes in terms of number of clusters, number of relevant features and number of data points.