Developing a text categorization template for Turkish news portals


Toraman Ç., Can F., Koçberber S.

2011 International Symposium on INnovations in Intelligent SysTems and Applications, INISTA 2011, Istanbul-Kadikoy, Türkiye, 15 - 18 Haziran 2011, ss.379-383 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/inista.2011.5946096
  • Basıldığı Şehir: Istanbul-Kadikoy
  • Basıldığı Ülke: Türkiye
  • Sayfa Sayıları: ss.379-383
  • Anahtar Kelimeler: news portals, text categorization, Turkish news
  • Orta Doğu Teknik Üniversitesi Adresli: Hayır

Özet

In news portals, text category information is needed for news presentation. However, for many news stories the category information is unavailable, incorrectly assigned or too generic. This makes the text categorization a necessary tool for news portals. Automated text categorization (ATC) is a multifaceted difficult process that involves decisions regarding tuning of several parameters, term weighting, word stemming, word stopping, and feature selection. In this study we aim to find a categorization setup that will provide highly accurate results in ATC for Turkish news portals. We also examine some other aspects such as the effects of training dataset set size and robustness issues. Two Turkish test collections with different characteristics are created using Bilkent News Portal. Experiments are conducted with four classification methods: C4.5, KNN, Naive Bayes, and SVM (using polynomial and rbf kernels). Our results recommends a text categorization template for Turkish news portals and provides some future research pointers. © 2011 IEEE.