Improving search result clustering by integrating semantic information from Wikipedia

Çağatay Çallı

Improving search result clustering by integrating semantic information from Wikipedia

Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Fen Bilimleri Enstitüsü, Fen Bilimleri Enstitüsü, Türkiye

Tezin Onay Tarihi: 2010

Öğrenci: Çağatay Çallı

Eş Danışman: GÖKTÜRK ÜÇOLUK, ONUR TOLGA ŞEHİTOĞLU

Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu

Özet:

Suffix Tree Clustering (STC) is a search result clustering (SRC) algorithm focused on generating overlapping clusters with meaningful labels in linear time. It showed the feasibility of SRC but in time, subsequent studies introduced description-first algorithms that generate better labels and achieve higher precision. Still, STC remained as the fastest SRC algorithm and there appeared studies concerned with different problems of STC. In this thesis, semantic relations between cluster labels and documents are exploited to filter out noisy labels and improve merging phase of STC. Wikipedia is used to identify these relations and methods for integrating semantic information to STC are suggested. Semantic features are shown to be effective for SRC task when used together with term frequency vectors. Furthermore, there were no SRC studies on Turkish up to now. In this thesis, a dataset for Turkish is introduced and a number of methods are tested on Turkish.